——Andrew Ng

## 0x02 聚合特征构造

```display(df.head(10))
# 输出
C1    C2    N1    N2
0    A    a    1    1.1
1    A    a    1    2.2
2    A    a    2    3.3
3    B    a    2    4.4
4    B    a    3    5.5
5    C    b    3    6.6
6    C    b    4    7.7
7    C    b    4    8.8
8    D    b    5    9.9
9    D    b    5    10.0```

### 1.分组统计特征

#### median(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'median'})
# 输出
N1
C1
A    1.0
B    2.5
C    4.0
D    5.0```

#### mean(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'mean'})
# 输出
N1
C1
A    1.333333
B    2.500000
C    3.666667
D    5.000000```

#### mode(N1)_by(C1)

```df.groupby(['C1'])['N1'].agg(lambda x: stats.mode(x)[0][0])
# 输出
C1
A    1
B    2
C    4
D    5```

#### min(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'min'})
# 输出
N1
C1
A    1
B    2
C    3
D    5```

#### max(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'max'})
# 输出
N1
C1
A    2
B    3
C    4
D    5```

#### std(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'std'})
# 输出
N1
C1
A    0.577350
B    0.707107
C    0.577350
D    0.000000```

#### var(N1)_by(C1)

```df.groupby(['C1']).agg({'N1': 'var'})
# 输出
N1
C1
A    0.333333
B    0.500000
C    0.333333
D    0.000000```

#### freq(C2)_by(C1)

```df.groupby(['C1']).agg({'C2': 'count'})
# 输出
C2
C1
A    3
B    2
C    3
D    2```

### 2.统计频数构造特征

freq(C1) ，直接统计类别特征的频数，这个不需要 groupby 也有意义。

```df['C1'].count()
# 输出：10```

### 3.分组统计和基础特征工程方法结合

#### 中位数分组和线性组合结合

N1 + median(N1)_by(C1)

N1 – median(N1)_by(C1)

N1 * median(N1)_by(C1)

N1 / median(N1)_by(C1)

```df = pd.merge(df, df.groupby(['C1'])['N1'].median().reset_index().rename(columns={'N1': 'N1_Median'}),
on='C1', how='left')
df['N1+Median(C1)'] = df['N1'] + df['N1_Median']
df['N1-Median(C1)'] = df['N1'] - df['N1_Median']
df['N1*Median(C1)'] = df['N1'] * df['N1_Median']
df['N1/Median(C1)'] = df['N1'] / df['N1_Median']
# 输出：
C1    C2    N1    N2  N1_Median N1+Median(C1) N1-Median(C1) N1*Median(C1) N1/Median(C1)
0    A    a    1    1.1  1.0      2.0           0.0           1.0           1.00
1    A    a    1    2.2  1.0      2.0           0.0           1.0           1.00
2    A    a    2    3.3  1.0      3.0           1.0           2.0           2.00
3    B    a    2    4.4  2.5      4.5           -0.5          5.0           0.80
4    B    a    3    5.5  2.5      5.5           0.5           7.5           1.20
5    C    b    3    6.6  4.0      7.0           -1.0          12.0          0.75
6    C    b    4    7.7  4.0      8.0           0.0           16.0          1.00
7    C    b    4    8.8  4.0      8.0           0.0           16.0          1.00
8    D    b    5    9.9  5.0      10.0           0.0          25.0          1.00
9    D    b    5    10.0 5.0      10.0           0.0          25.0          1.00```

#### 均值分组和线性组合结合

N1 + mean(N1)_by(C1)

N1 – mean(N1)_by(C1)

N1 * mean(N1)_by(C1)

N1 / mean(N1)_by(C1)

```df = pd.merge(df, df.groupby(['C1'])['N1'].mean().reset_index().rename(columns={'N1': 'N1_Mean'}),
on='C1', how='left')
df['N1+Mean(C1)'] = df['N1'] + df['N1_Mean']
df['N1-Mean(C1)'] = df['N1'] - df['N1_Mean']
df['N1*Mean(C1)'] = df['N1'] * df['N1_Mean']
df['N1/Mean(C1)'] = df['N1'] / df['N1_Mean']
# 输出：
C1    C2    N1    N2   N1_Mean  N1+Mean(C1) N1-Mean(C1) N1*Mean(C1) N1/Mean(C1)
0    A    a    1    1.1  1.333333 2.333333    -0.333333   1.333333    0.750000
1    A    a    1    2.2  1.333333 2.333333    -0.333333   1.333333    0.750000
2    A    a    2    3.3  1.333333 3.333333    0.666667    2.666667    1.500000
3    B    a    2    4.4  2.500000 4.500000    -0.500000   5.000000    0.800000
4    B    a    3    5.5  2.500000 5.500000    0.500000    7.500000    1.200000
5    C    b    3    6.6  3.666667 6.666667    -0.666667   11.000000   0.818182
6    C    b    4    7.7  3.666667 7.666667    0.333333    14.666667   1.090909
7    C    b    4    8.8  3.666667 7.666667    0.333333    14.666667   1.090909
8    D    b    5    9.9  5.000000 10.000000   0.000000    25.000000   1.000000
9    D    b    5    10.0 5.000000 10.000000   0.000000    25.000000   1.000000```

## 0x03 简单转换特征构造

### 1.单列特征加/减/乘/除一个常数

#### 程序实现

```df['Feature'] = df['Feature'] + n
df['Feature'] = df['Feature'] - n
df['Feature'] = df['Feature'] * n
df['Feature'] = df['Feature'] / n```

### 2.单列特征单调变换

#### 程序实现

```import numpy as np
# 计算n次方
df['Feature'] = df['Feature']**2
# 计算log变换
df['Feature'] = np.log(df['Feature'])```

### 3.线性组合（linear combination）

[A X B]：

[A x B x C x D x E]：

[A x A]：

#### 程序实现

```df['Feature'] = df['A'] * df['B']
df['Feature'] = df['A'] * df['B'] * df['C'] * df['D'] * df['E']
df['Feature'] = df['A'] * df['A']```

### 4.多项式特征（polynomial feature）

#### 程序实现

```import numpy as np
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3, 2)
print(X)
# 输出：array([[0, 1],
[2, 3],
[4, 5]])
# 设置多项式阶数为２
poly = PolynomialFeatures(2)
print(poly.fit_transform(X))
# 输出：array([[ 1.,  0.,  1.,  0.,  0.,  1.],
[ 1.,  2.,  3.,  4.,  6.,  9.],
[ 1.,  4.,  5., 16., 20., 25.]])
＃默认的阶数是２，同时设置交互关系为true
poly = PolynomialFeatures(interaction_only=True)
print(poly.fit_transform(X))
# 输出：array([[ 1.,  0.,  1.,  0.],
[ 1.,  2.,  3.,  6.],
[ 1.,  4.,  5., 20.]])```

### 5.比例特征（ratio feature）

#### 程序实现

`df['Feature'] = df['X1']/df['X2']`

### 6.绝对值特征（absolute value）

#### 程序实现

```import numpy as np
df['Feature'] = np.abs(df['Feature'])```

### 7.最大值特征

#### 程序实现

```# 最大值
df['Feature'] = df.apply(lambda x: max(x['X1'], x['X2']), axis=1)```

### 8.最小值特征

#### 程序实现

```# 最小值
df['Feature'] = df.apply(lambda x: min(x['X1'], x['X2']), axis=1)```

### 9.排名编码特征

#### 程序实现

```X = [10, 9, 9, 8, 7]
df = pd.DataFrame({'X': X,})
df['num'] = df['X'].rank(ascending=0, method='dense')
# 输出
X    num
0    10    1.0
1    9    2.0
2    9    2.0
3    8    3.0
4    7    4.0```

### 10.异或值特征

#### 程序实现

```# 按位进行异或运算
df['Feature'] = df.apply(lambda x: x['X1'] ^ x['X2'], axis=1)```

## 0x0FF 总结

### 参考文献

[2] https://www.cnblogs.com/nxf-rabbit75/p/11141944.html#_nav_12

[4] 利用 gplearn 进行特征工程. https://bigquant.com/community/t/topic/120709