## 绪论

### 不同的机器学习模型

Logistic回归。Logistic回归是一种有监督的机器学习算法，用于对某一类别或事件的概率进行建模。当数据是线性可分离的，通常用于二元分类任务。这种算法的核心是logistic函数（也称为sigmoid函数），它可以取任何实数并将其映射为0（排除）和1（排除）之间的值。由于这个原因，逻辑回归的预测y值在0和1之间，而线性回归的预测y值则可以超过这个范围。

## 安装和设置

Pandas
NumPy
Scikit-Learn
XGBoost

```pip install pandas
pip install numpy
pip install scikit-learn
pip install xgboost```

```from warnings import filterwarnings
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from xgboost import XGBClassifier```

## 读取数据

`df = pd.read_csv("train.csv")`

`df.head()`

```y = df["Survived"]
X = df.drop(["Survived", "PassengerId", "Name"], axis=1) #drops all the columns we don't need in X```

## 分割数据

`X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)`

## 处理缺失值

```cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
print(cols_with_missing)```

`['Age', 'Cabin', 'Embarked']`

## 方法1：删除缺失值

```cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)```

## 选项2：归因法

```# create a list of numerical columns
numerical_cols = X_train.select_dtypes(include=np.number).columns.tolist()
# create a list of categorical columns
s = (X_train.dtypes == 'object')
categorical_cols = list(s[s].index)
# create a copy of the original data so that none of the original data is changed
imputed_X_train = X_train.copy()
imputed_X_valid = X_valid.copy()
# create two imputers - one for the numerical columns, and the other for the categorical columns
numerical_imputer = SimpleImputer(strategy="mean")
categorical_imputer = SimpleImputer(strategy="most_frequent")
# use the imputer and update the training data
imputed_X_train[numerical_cols] = numerical_imputer.fit_transform(imputed_X_train[numerical_cols])
imputed_X_train[categorical_cols] = categorical_imputer.fit_transform(imputed_X_train[categorical_cols])
# use the imputer and update the validation data
imputed_X_valid[numerical_cols] = numerical_imputer.transform(imputed_X_valid[numerical_cols])
imputed_X_valid[categorical_cols] = categorical_imputer.transform(imputed_X_valid[categorical_cols])```

## 处理分类变量

```# Get list of categorical variables
s = (imputed_X_train.dtypes == 'object')
categorical_cols = list(s[s].index)
print("Categorical variables:")
print(categorical_cols)```

```Categorical variables:
['Sex', 'Ticket', 'Cabin', 'Embarked']```

```for feature in categorical_cols:
print(f"{feature}: {len(imputed_X_train[feature].unique())}")```

```Sex: 2
Ticket: 569
Cabin: 127
Embarked: 3```

```# delete the "Ticket" and "Cabin" columns from both the training and validation data
imputed_X_train.drop(["Ticket", "Cabin"], axis=1, inplace=True)
imputed_X_valid.drop(["Ticket", "Cabin"], axis=1, inplace=True)
reduced_X_train.drop("Ticket", axis=1, inplace=True) # cabin was already dropped when missing values were dropped
reduced_X_valid.drop("Ticket", axis=1, inplace=True) # cabin was already dropped when missing values were dropped```

## 丢弃分类特征

```def drop_categorical_variables(train_df, valid_df):
drop_train = train_df.select_dtypes(exclude=['object'])
drop_valid = valid_df.select_dtypes(exclude=['object'])
return drop_train, drop_valid```

`drop_X_train, drop_X_valid = drop_categorical_variables(imputed_X_train, imputed_X_valid)`

`print(list(drop_X_train.columns))`

`['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']`

## 顺序编码

`print(list(imputed_X_train["Embarked"].unique()))`

`['C', 'S', 'Q']`

```def ordinal_encode(train_df, val_df):
s = (train_df.dtypes == 'object')
categorical_cols = list(s[s].index)
# Apply ordinal encoder to each column with categorical data
ordinal_encoded_X_train = train_df.copy()
ordinal_encoded_X_valid = val_df.copy()
ordinal_encoder = OrdinalEncoder()
ordinal_encoded_X_train[categorical_cols] = ordinal_encoder.fit_transform(train_df[categorical_cols])
ordinal_encoded_X_valid[categorical_cols] = ordinal_encoder.transform(val_df[categorical_cols])

return ordinal_encoded_X_train, ordinal_encoded_X_valid```

`ordinal_encoded_X_train, ordinal_encoded_X_valid = ordinal_encode(imputed_X_train, imputed_X_valid)`

`print(list(ordinal_encoded_X_train["Embarked"].unique()))`

`[0.0, 2.0, 1.0]`

## 一热编码

```def one_hot_encode(train_df, validation_df):
s = (train_df.dtypes == 'object')
categorical_cols = list(s[s].index)
# handle_unknown='ignore' helps avoid errors when the validation data contains classes that are not in the training data
# sparse = false makes sure that it returns a numpy array instead of a sparse matrix
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
# create new dataframes for the one-hot encoded features
OH_cols_train = pd.DataFrame(encoder.fit_transform(train_df[categorical_cols]))
OH_cols_valid = pd.DataFrame(encoder.transform(validation_df[categorical_cols]))
# One-hot encoding removed the index, so we put it back here.
# This is used for combining the one-hot encoded columns with the numerical columns, otherwise alignment issues arise
OH_cols_train.index = train_df.index
OH_cols_valid.index = validation_df.index

# fix the column names
OH_cols_train.columns = encoder.get_feature_names_out()
OH_cols_valid.columns = encoder.get_feature_names_out()

# remove the original categorical columns
num_X_train = train_df.drop(categorical_cols, axis=1)
num_X_valid = validation_df.drop(categorical_cols, axis=1)
# combine the encoded features to the numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

return OH_X_train, OH_X_valid```

```OH_X_train, OH_X_valid = one_hot_encode(imputed_X_train, imputed_X_valid)

## 支持向量机(SVM)

```svm_clf = svm.LinearSVC()
svm_clf.fit(OH_X_train, y_train)```

## Logistic回归

Logistic回归通过将数据拟合到一个logit函数来预测输出类的概率。

```lr=LogisticRegression()
lr.fit(OH_X_train, y_train)```

## 随机森林

```clf = RandomForestClassifier()
clf.fit(OH_X_train, y_train)```

## 测试、结果和对结果的解释

```# make sure none of the dataframes contains the "Ticket" or "Cabin" columns - if they do, uncomment the code below
# imputed_X_train.drop(["Ticket", "Cabin"], axis=1, inplace=True)
# imputed_X_valid.drop(["Ticket", "Cabin"], axis=1, inplace=True)
# reduced_X_train.drop(["Ticket", "Cabin"], axis=1, inplace=True)
# reduced_X_valid.drop(["Ticket", "Cabin"], axis=1, inplace=True)
missing_values_handled_dfs = [
{
"description" : "Dropped Missing Values",
"train" : reduced_X_train,
"validation" : reduced_X_valid
},
{
"description" : "Imputed",
"train" : imputed_X_train,
"validation" : imputed_X_valid
}
]```

```dfs_to_be_used = [] #list of dictionaries - each dict will contain a description, the training dataframe, and the validation dataframe
for df_dict in missing_values_handled_dfs:
train_df = df_dict["train"]
validation_df = df_dict["validation"]
description = df_dict["description"]

# drop categorical features
drop_X_train, drop_X_valid = drop_categorical_variables(train_df, validation_df)
dfs_to_be_used.append({
"description" : description + ", dropped categorical features",
"train" : drop_X_train,
"validation" : drop_X_valid
})

# ordinal encoding
ordinal_encoded_X_train, ordinal_encoded_X_valid = ordinal_encode(train_df, validation_df)
dfs_to_be_used.append({
"description" : description + ", ordinal encoding ",
"train" : ordinal_encoded_X_train,
"validation" : ordinal_encoded_X_valid
})

#one-hot encoding
OH_X_train, OH_X_valid = one_hot_encode(train_df, validation_df)
dfs_to_be_used.append({
"description" : description + ", one-hot encoding ",
"train" : OH_X_train,
"validation" : OH_X_valid
})```

`filterwarnings('ignore')`

`y_train``y_valid` ，下面使用的是在本文开始时创建的，当时我们将数据分成80%的训练和20%的验证，输出每个模型的结果。

```for df in dfs_to_be_used:
print(df["description"] + ":")

svm_clf = svm.LinearSVC()
svm_clf.fit(df["train"], y_train)

print("\t" + "SVM: " + str(accuracy_score(y_valid, svm_clf.predict(df["validation"]))))

lr=LogisticRegression()
lr.fit(df["train"], y_train)

print("\t" + "Logistic Regression: " + str(accuracy_score(y_valid, lr.predict(df["validation"]))))

rf = RandomForestClassifier()
rf.fit(df["train"], y_train)

print("\t" + "Random Forest: " + str(accuracy_score(y_valid, rf.predict(df["validation"]))))```

```Dropped Missing Values, dropped categorical features:
SVM: 0.7262569832402235
Logistic Regression: 0.7150837988826816
Random Forest: 0.7150837988826816
Dropped Missing Values, ordinal encoding :
SVM: 0.776536312849162
Logistic Regression: 0.7988826815642458
Random Forest: 0.8324022346368715
Dropped Missing Values, one-hot encoding :
SVM: 0.6759776536312849
Logistic Regression: 0.7988826815642458
Random Forest: 0.8379888268156425
Imputed, dropped categorical features:
SVM: 0.6983240223463687
Logistic Regression: 0.7374301675977654
Random Forest: 0.6927374301675978
Imputed, ordinal encoding :
SVM: 0.664804469273743
Logistic Regression: 0.7988826815642458
Random Forest: 0.8324022346368715
Imputed, one-hot encoding :
SVM: 0.8100558659217877
Logistic Regression: 0.8044692737430168
Random Forest: 0.8435754189944135```