We’ll follow the general machine learning workflow step-by-step 第二部分:

1. Data cleaning and formatting 数据清洗和格式化 2. Exploratory data analysis 探索性的数据分析 3. Feature engineering and selection 字段的加工和选择 4. Hide and filter Sensitive Feature 数据脱敏 5. Compare several machine learning models on a performance metric 模型的评估和选择 6. Perform hyperparameter tuning on the best model 用于模型优化的超参数调整 7. Evaluate the best model on the testing set 评估测试集 8. Interpret the model results 模型解释 9. Draw conclusions and document work 记录工作和报告结果

Extrac from

•  人工智能全过程，一步一步，第一部分  [1] ， •  人工智能全过程，一步一步，第二部分  [2]  , •  人工智能全过程，一步一步，第三部分  [3] ，

## 模型的评估和选择 (Model Evaluation and Selection)

• 线性回归 Linear Regression • K-Nearest Neighbors Regression •  随机森林回归 Random Forest Regression • Gradient Boosted Regression •  支持向量机回归 Support Vector Machine Regression

import pandas as pd

import numpy as np

Training Feature Size: (6622, 64)

Testing Feature Size: (2839, 64)

Training Labels Size: (6622, 1)

Testing Labels Size: (2839, 1)

(我们必须以这种方式进行估算，而不是对所有数据进行训练，以避免测试数据泄漏的问题，其中来自测试数据集的信息溢出到训练数据中)

# Create an imputer object with a median filling strategy

imputer = Imputer(strategy=’median’)

# Train on the training features

imputer.fit(train_features)

# Transform both training data and testing data

X = imputer.transform(train_features)

X_test = imputer.transform(test_features)

Missing values in training features: 0

Missing values in testing features: 0

# Create the scaler object with a range of 0-1

scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data

scaler.fit(X)

# Transform both the training and testing data

X = scaler.transform(X)

X_test = scaler.transform(X_test)

### 在 Scikit-Learn 中实现机器学习模型 (Implementing Machine Learning Models in Scikit-Learn)

# Create the model

# Fit the model on the training data

# Make predictions on the test data

# Evaluate the model

mae = np.mean(abs(predictions – y_test))

print(‘Gradient Boosted Performance on the test set: MAE = %0.4f’ % mae)

## 用于模型优化的超参数调整 (Hyperparameter Tuning for Model Optimization)

• 模型超参数最好被认为是在训练之前由数据科学家设置的机器学习算法的参数。例如，随机森林 (Random Forest) 中的树木数量或 K 近邻算法 (K-nearest neighbors) 中使用的邻居数量。 •  模型参数是模型在训练期间学习的内容，例如线性回归中的权重。控制超参数通过改变模型中的  欠拟合 (underfit) 和过度拟合 (overfit) [8]  之间的平衡来影响模型性能。欠拟合 (underfit) 是指我们的模型不够复杂（它没有足够的自由度）来学习从字段到目标的映射。欠拟合 (underfit) 模型具有  高偏差  [9] ，我们可以通过使模型更复杂来纠正。

K = 5 的 K-Fold 交叉验证的想法如下所示：

1. 设置一个超参数网格进行评估 2.  随机抽样超参数的组合 3.  使用所选组合创建模型 4.  使用 K 折叠交叉验证评估模型 5.  确定哪些超参数最有效

• loss：损失函数最小化 • n_estimators：要使用的弱学习者（决策树）的数量 • max_depth：每个决策树的最大深度 • min_samples_leaf：决策树的叶节点所需的最小示例数 • min_samples_split：拆分决策树节点所需的最小示例数 • max_features：用于拆分节点的最大功能数

# Loss function to be optimized

# Number of trees used in the boosting process

n_estimators = [100, 500, 900, 1100, 1500]

# Maximum depth of each tree

max_depth = [2, 3, 5, 10, 15]

# Minimum number of samples per leaf

min_samples_leaf = [1, 2, 4, 6, 8]

# Minimum number of samples to split a node

min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits

max_features = [‘auto’, ‘sqrt’, ‘log2’, None]

# Define the grid of hyperparameters to search

hyperparameter_grid = {‘loss’: loss,

‘n_estimators’: n_estimators,

‘max_depth’: max_depth,

‘min_samples_leaf’: min_samples_leaf,

‘min_samples_split’: min_samples_split,

‘max_features’: max_features}

# Create the model to use for hyperparameter tuning

# Set up the random search with 4-fold cross validation

random_cv = RandomizedSearchCV(estimator=model,

param_distributions=hyperparameter_grid,

cv=4, n_iter=25,

scoring = ‘neg_mean_absolute_error’,

n_jobs = -1, verbose = 1,

return_train_score = True,

random_state=42)

# Fit on the training data

random_cv.fit(X, y)

# Find the best combination of settings

random_cv.best_estimator_

max_features=None, min_samples_leaf=6,

min_samples_split=6,

n_estimators=500)

## 评估测试集 (Evaluating on the Test Set)

# Make predictions on the test set using default and final model

default_pred = default_model.predict(X_test)

final_pred = final_model.predict(X_test)

Default model performance on the test set: MAE = 10.0118.

Final model performance on the test set: MAE = 9.0446.

%%timeit -n 1 -r 5

default_model.fit(X, y)

1.09 s ± 153 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

%%timeit -n 1 -r 5

final_model.fit(X, y)

12.1 s ± 1.33 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

• 填补缺失值和缩放特征 •  评估和比较几种机器学习模型 •  使用随机网格搜索和交叉验证进行超参数调整 •  评估测试集上的最佳模型

#### 参考文献

• Hands-On Machine Learning with Scikit-Learn and Tensorflow (Jupyter Notebooks for this book are available online for free)! [20] • An Introduction to Statistical Learning [21] • Kaggle: [22]  The Home of Data Science and Machine Learning • Datacamp: [23]  Good beginner tutorials for practicing data science coding • Coursera: [24]  Free and paid courses in many subjects • Udacity: [25]  Paid programming and data science courses

#### 引用链接

`[1]` 人工智能全过程，一步一步，第一部分:  https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420

`[2]` 人工智能全过程，一步一步，第二部分:  https://towardsdatascience.com/a-complete-machine-learning-project-walk-through-in-python-part-two-300f1f8147e2

`[3]` 人工智能全过程，一步一步，第三部分:  https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-three-388834e8804b

`[4]` 在 Github 里:  https://github.com/WillKoehrsen/machine-learning-project-walkthrough

`[5]` 统计学习简介:  http://www-bcf.usc.edu/~gareth/ISL/

`[6]` 用 Scikit-Learn 和 TensorFlow 熟练掌握机器学习:  http://shop.oreilly.com/product/0636920052289.do

`[7]` 什幺是超参数？它们与参数有何不同:  https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/

`[8]` 欠拟合 (underfit) 和过度拟合 (overfit):  https://towardsdatascience.com/overfitting-vs-underfitting-a-conceptual-explanation-d94ee20ca7f9

`[9]` 高偏差:  https://en.wikipedia.org/wiki/Bias–variance_tradeoff

`[10]` 高方差，high variance:  https://en.wikipedia.org/wiki/Bias–variance_tradeoff

`[11]` Epistasis Lab 的 TPOT:  https://epistasislab.github.io/tpot/

`[12]` 遗传编程:  https://en.wikipedia.org/wiki/Genetic_programming

`[13]` 随机森林等套袋算法:  https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

`[14]` 梯度提升方法:  http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

`[15]` XGBoost:  https://xgboost.readthedocs.io/en/latest/model.html

`[16]` 机器学习的递减收益定律:  http://www.picnet.com.au/blogs/guido/2018/04/13/diminishing-returns-machine-learning-projects/

`[17]` 偏差与方差:  https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

`[18]` 此处提供:  https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-three-388834e8804b

`[19]` @koehrsen_will:  https://twitter.com/koehrsen_will

`[20]` Hands-On Machine Learning with Scikit-Learn and Tensorflow (Jupyter Notebooks for this book are available online for free)!:  http://shop.oreilly.com/product/0636920052289.do

`[21]` An Introduction to Statistical Learning:  http://www-bcf.usc.edu/~gareth/ISL/

`[22]` Kaggle::  https://www.kaggle.com/

`[23]` Datacamp::  https://www.datacamp.com/

`[24]` Coursera::  https://www.coursera.org/

`[25]` Udacity::  https://www.udacity.com/