## 1.导入标准库

numpy：包含很多机器学习需要用到的数学方法
matplotlib.pyplot：主要用于绘图
pandas：导入数据集以及对数据集进行一系列的处理

```import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## 2.导入数据集

iloc数组中参数：逗号左边表示行数，逗号右边表示列数，冒号表示选择所有行或者列

```# Import the dataset
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

## 3.缺失数据

Imputer这个类主要用于缺失数据的处理 参数axis：

axis = 0 取一列的平均值
axis = 1 取一行的平均值

If “mean”, then replace missing values using the mean along the axis.
If “median”, then replace missing values using the median along the axis.
If “most_frequent”, then replace missing using the most frequent value along the axis.

```# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
imputer.fit(X[:,1:3])  #代表1和2
X[:,1:3] = imputer.transform(X[:,1:3])

## 4.分类数据

### 4.2. 独热编码（虚拟编码）（dummy coding）

```# Encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Encode labels with value between 0 and n_classes-1(将不同组的名称转换为数字)

## 5.将数据集划分为训练集和测试集

test_size：0到1之间，默认值为0.25 一般情况下比较好的为0.2或者0.25
random_state：决定随机数生成的方式

```# splitting dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=0)

## 6.特征缩放

### 6.1. 标准化（Standardisation）

```# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### 6.2. 归一化（Normalisation）

```import sklearn.preprocessing as sp
mms = sp.MinMaxScaler(feature_range=(0,1))
mms_samples2 = mms.fit_transform(raw_samples)

## 7.数据预先处理模板

```import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import the dataset
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
imputer = Imputer()
test = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:,0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

## 8.问题

### 8.1. fit、fit_transform、transform之间的区别？

fit():简单来说，就是求得训练集的均值、方差、最大值、最小值，也就是训练集X的属性，可以理解为一个训练过程；
Transform():在fit的基础上，进行标准化、降维、归一化等操作；
fit_transform():是fit和transform的组合，既包含训练又包含转换。

1. 必须先用fit_transform(trainData)，之后再用transform(testData)；

1. 如果直接使用transform(testData)会报错；

1. 如果fit_transform(trainData)后，使用fit_transform(testData)而不是transform(testData)，虽然也能归一化，但是两个结果是不在同一个“标准”下的，具有明显的差异。