## 1. 关于数据分析与特征

（1））中国各省份马拉松跑者TGI情况。

（2））普通人群与马拉松跑者BMI对比

BMI指数计算公式为：BMI=体重（千克）除以身高（米）的平方。中国人BMI标准：18.4及以下为过瘦，18.5~23.9为标准，24.0~27.9为轻度超标，28.0~29.9为中度超标，大于40.0为重度超标。

（1）探索性数据分析

（2）特征理解

（3）特征预处理

（4）特征选择

（5）特征转换

（1）删除含有缺失值的个案

（2）可能值插补缺失值

## 3.2. 离群值

### 3.2.3. 离群值检测工具PyOD

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from pyod.models.abod import ABOD
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
#%matplotlib inline #Jupyter notebook中使用
import matplotlib.font_manager
from pyod.utils.data import generate_data, get_outliers_inliers
#generate random data with two features
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=2)
# by default the outlier fraction is 0.1 in generate data function
outlier_fraction = 0.1
# store outliers and inliers in different numpy arrays
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)
n_inliers = len(x_inliers)
n_outliers = len(x_outliers)
#separate the two features and use it to plot the data
F1 = X_train[:,[0]].reshape(-1,1)
F2 = X_train[:,[1]].reshape(-1,1)
# create a meshgrid
xx , yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200))
random_state = np.random.RandomState(42)
outliers_fraction = 0.05
classifiers = {

'Angle-based Outlier Detector (ABOD)' : ABOD(contamination=outlier_fraction),
'K Nearest Neighbors (KNN)' : KNN(contamination=outlier_fraction),
'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state)
}
# scatter plot
plt.scatter(F1,F2)
plt.xlabel('F1')
plt.ylabel('F2')
#set the figure size
plt.figure(figsize=(10, 10))
for i, (clf_name,clf) in enumerate(classifiers.items()) :
# fit the dataset to the model
clf.fit(X_train)
# predict raw anomaly score
scores_pred = clf.decision_function(X_train)*-1
# prediction of a datapoint category outlier or inlier
y_pred = clf.predict(X_train)
# no of errors in prediction
n_errors = (y_pred != Y_train).sum()
print('No of Errors : ',clf_name, n_errors)
# rest of the code is to create the visualization
# threshold value to consider a datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred,100 *outlier_fraction)
# decision function calculates the raw anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
subplot = plt.subplot(1, 3, i + 1)
# fill blue colormap from minimum anomaly score to threshold value
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 10),cmap=plt.cm.get_cmap(name='Blues_r'))
# draw red contour line where anomaly score is equal to threshold
a = subplot.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
# scatter plot of inliers with white dots
b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white',s=20, edgecolor='k')
# scatter plot of outliers with black dots
c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black',s=20, edgecolor='k')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=9),
loc='lower right')
subplot.tick_params(labelsize=8)
subplot.set_title(clf_name,fontsize = 10)
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()

（1）删除

（2）放任不管

（3）数据处理

## 4. 特征预处理

（1）类别特征：one-hot encoding、hash encoding、label encoding（标签编码）、count encoding（频数编码）、label-count encoding、target encoding（二分类）、category embedding、Nan encoding、polynomial encoding、expansion encoding、consolidation encoding、dummy encoding（哑变量编码）。

（2）数值特征：rounding、binning、scaling、imputation、interactions、no linear encoding、row statistics。

（3）时间特征

（4）空间特征

（5）自然语言处理

（6）深度学习/NN

（9）Leakage

## 4.1. 数据归一化与标准化

### 4.1.1. 归一化

（1）最大、最小值归一化（缩放）

x

=

x

i

x

m

i

n

x

m

a

x

x

m

i

n

x’=\frac{x_{i}-x_{min}}{x_{max}-x_{min}}

x

m
a
x

x

m
i
n

x

i

x

m
i
n

x

x’

x

—输出新值，范围为

[

0

,

1

]

[0,1]

[
0
,
1
]

x

i

x_{i}

x

i

—输入值（原始值）

x

m

i

n

x_{min}

x

m
i
n

—输入最小值

x

m

a

x

x_{max}

x

m
a
x

—输入最大值

（2）平均归一化

x

=

x

i

x

m

i

n

x

m

a

x

x

m

i

n

x’=\frac{x_{i}-x_{min}}{x_{max}-x_{min}}

x

m
a
x

x

m
i
n

x

i

x

m
i
n

x

x’

x

—输出新值，范围为

[

1

,

1

]

[-1,1]

[

1
,
1
]

μ

\mu

μ

x

i

x_{i}

x

i

（3）非线性归一化

x

=

l

o

g

10

x

x’ = log_{10}x

l
o

g

1
0

x

x

=

a

t

a

n

(

x

)

×

2

/

π

x’ = atan(x) \times 2/ \pi

a
t
a
n
(
x
)
×

2
/
π

### 4.1.2. 标准化（Z-Score）

x

=

x

i

μ

σ

x’ = \frac{x_{i}- \mu }{\sigma}

σ

x

i

μ

x

x’

x

—新值

x

i

x_{i}

x

i

—输入值（原始值）

μ

\mu

μ

σ

\sigma

σ

X

i

X_{i}

X

i

### 4.1.3. 中心化

x

=

x

μ

x’ = x – \mu

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import matplotlib.pyplot as plt
def get_data():
cols_name=['fuelle_date','fuelle','amount','fuel_interva']
df = df[(df['fuelle_date'] >'2019-11-30') & (df['fuelle_date'] <'2020-05-30')]
df = df[cols_name]
return df
def contrast_Normalization(df):
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
y_lim = 750
#df.plot(kind='density', subplots=True, fontsize=8)
df.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))

df1 = (df[['fuelle','amount','fuel_interva']]-df[['fuelle','amount','fuel_interva']].mean())/df[['fuelle','amount','fuel_interva']].std()
df1 = pd.concat([df['fuelle_date'],df1],axis=1)
y_lim = 12
df1.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))

df2 = (df[['fuelle','amount','fuel_interva']]-df[['fuelle','amount','fuel_interva']].min())/(df[['fuelle','amount','fuel_interva']].max()-df[['fuelle','amount','fuel_interva']].min())
df2 = pd.concat([df['fuelle_date'],df2],axis=1)
y_lim = 1
df2.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))
plt.show()

return

if __name__ == '__main__':
df = get_data()
contrast_Normalization(df)

fit_transform方法是fit和transform的结合，fit_transform(X_train) 意思是找出X_train的

μ

\mu

μ

σ

\sigma

σ

，并应用在X_train上。

def contrast_Normalization(df):
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False
y_lim = 750
#df.plot(kind='density', subplots=True, fontsize=8)
df.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))

fuelle = df['fuelle'].values
amount = df['amount'].values
fuel_interva = df['fuel_interva'].values
fuelle_date = df['fuelle_date'].to_list()
# 标准化
ss = StandardScaler()
fuelle0 = ss.fit_transform(fuelle.reshape(-1,1))
amount0 = ss.fit_transform(amount.reshape(-1,1))
fuel_interva0 = ss.fit_transform(fuel_interva.reshape(-1,1))
'''
b = fuelle0.reshape(1,-1)
c = b[0]
tmp = c.tolist()
'''
df1 = pd.DataFrame({
'fuelle_date':fuelle_date,'fuelle':(fuelle0.reshape(1,-1))[0].tolist(),'amount':(amount0.reshape(1,-1))[0].tolist(),'fuel_interva':(fuel_interva0.reshape(1,-1))[0].tolist()})
y_lim = 12
df1.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))
# 归一化
mm = MinMaxScaler()
fuelle1 = mm.fit_transform(fuelle.reshape(-1,1))
amount1 = mm.fit_transform(amount.reshape(-1,1))
fuel_interva1 = mm.fit_transform(fuel_interva.reshape(-1,1))
df2 = pd.DataFrame({
'fuelle_date':fuelle_date,'fuelle':(fuelle1.reshape(1,-1))[0].tolist(),'amount':(amount1.reshape(1,-1))[0].tolist(),'fuel_interva':(fuel_interva1.reshape(1,-1))[0].tolist()})
y_lim = 1
df2.plot(kind='line', x='fuelle_date', subplots=True, fontsize=8, ylim=(0, y_lim))

plt.show()

return

## 4.2. One-Hot编码

### 4.2.1. One-hot编码

data = [['92#汽油',5.74,50000],
['95#汽油',6.14,13000],
['0#柴油',5.21,30000]]

data = [[1,0,0,5.74,50000],
[0,1,0,6.14,,13000],
[0,0,1,5.21,30000]]

### 4.2.2. Sklearn实现举例

price_sensitive数据为[0，1，2]，分别代表价格不敏感、一般敏感、敏感三种分类。

from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
def get_data():
cols_name=['price_sensitive','fuelle_date','fuelle','amount','fuel_interva']

df = df[(df['fuelle_date'] >'2019-11-30') & (df['fuelle_date'] <'2020-05-30')]
df = df[cols_name]
return df
def Process_one_hot(data):
price_sensitive = data['price_sensitive'].values
encoder=OneHotEncoder(sparse=False) # One-Hot编码
ans=encoder.fit_transform(price_sensitive.reshape((-1,1)))
print(encoder.inverse_transform(ans))#此为反向编码，当你求得最终的结果时候，
#可以进行反向编码以获得我们原来数据集中的编码形式
return ans
if __name__ == '__main__':
data=get_data()
lable=Process_one_hot(data)
print(lable)

## 4.3. 数据变换

log变换 x=ln(x)
box-cox变换，自动寻找最佳正态分布变换函数的方法

x

=

{

x

λ

1

λ

λ

0

l

n

(

x

)

λ

=

0

x’ = \left\{\begin{matrix} \frac{x^{\lambda }-1}{\lambda} & \lambda \neq 0\\ ln(x) & \lambda = 0 \end{matrix}\right.

{

l
n
(
x
)

λ

=

0

λ
=
0

# 数据变换，其df = get_data()，见前面的代码引用
def Data_Transformation(df):
plt.rcParams['font.sans-serif']=['SimHei'] #显示中文标签
plt.rcParams['axes.unicode_minus']=False

df = df[['amount']]
df.plot(kind='hist', subplots=True, fontsize=8)

df['log_amount'] = np.log(1 + df['amount'].values) #防止出现0
df1 = df[['log_amount']]
df1.plot(kind='hist', subplots=True, fontsize=8)

plt.show()

《使用sklearn做单机特征工程》

《中国马拉松跑者研究蓝皮书》

《数据缺失值的4种处理方法》

《几种常见的离群点检验方法》

《了解离群值以及如何使用Python中的PyOD检测离群值》
Python Free 2020年2月

《 sklearn 中文文档》

《特征预处理》

《Python sklearn决策树算法实践》
CSDN博客， 肖永威 2018年4月

《Pandas（数据表）深入应用经验小结（查询、分组、上下行间计算等）》
CSDN博客 ，肖永威 ，2020年8月

《特征工程入门与实践》庄嘉盛译， 人民邮电出版社 ，2019年