### 数值数据的特征预处理

```count    148654.000000
mean      74768.321972
std       50517.005274
min        -618.130000
25%       36168.995000
50%       71426.610000
75%      105839.135000
max      567595.430000
Name: TotalPay, dtype: float64```

### 1. 特征缩放(归一化)

Rescaling (Min-Max归一化):这是一种最简单的归一化，将特征重新划分为[0,1]范围。

```from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
TotalPayReshaped = df.TotalPay.as_matrix().reshape(-1, 1)
df.TotalPay = scaler.fit_transform(TotalPayReshaped)
df.hist(figsize=(15,6), column=[‘TotalPay’])```

```count    148654.000000
mean          0.132673
std           0.088905
min           0.000000
25%           0.064742
50%           0.126792
75%           0.187354
max           1.000000
Name: TotalPay, dtype: float64```

Standardization (Z-score归一化):在这种归一化中，对一个特征进行缩放，使其均值为零，方差为1。

```from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
TotalPayReshaped = df.TotalPay.as_matrix().reshape(-1, 1)
df.TotalPay = scaler.fit_transform(TotalPayReshaped)
df.hist(figsize=(15,6), column=[‘TotalPay’])```

```count    1.486540e+05
mean    -2.707706e-15
std      1.000003e+00
min     -1.492304e+00
25%     -7.640884e-01
50%     -6.615046e-02
75%      6.150586e-01
max      9.755700e+00
Name: TotalPay, dtype: float64```

### 2. 离群值删除

```x = df.TotalPay
UPPERBOUND, LOWERBOUND = np.percentile(x, [1,99])
y = np.clip(x, UPPERBOUND, LOWERBOUND)
pd.DataFrame(y).hist()```

```count    148654.000000
mean      74502.923277
std       49644.336571
min         286.971000
25%       36168.995000
50%       71426.610000
75%      105839.135000
max      207015.797400
Name: TotalPay, dtype: float64```

### 4. 对数变换

`df['LogTotalPay'] = np.log(1+df.TotalPay)`

```count    148653.000000
mean         10.739295
std           1.413888
min           0.000000
25%          10.495993
50%          11.176446
75%          11.569692
max          13.249166
Name: LogTotalPay, dtype: float64```