Press "Enter" to skip to content

## 良好特征的特点

### 避免很少使用的离散特征值

`unique_house_id: 8SK982ZZ1242Z`

### 最好具有清晰明确的含义

`house_age: 27`

`house_age: 851472000`

`user_age: 277`

### 实际数据内不要掺入特殊值

`quality_rating: 0.82`

`quality_rating: 0.37`

`quality_rating: -1`

### 考虑上游不稳定性

`city_id: "br/sao_paulo"`

`inferred_city_cluster: "219"`

### 缩放特征值

`scaledvalue "=("value"-"mean")/"stddev.`

`scaled_value = (130 - 100) / 20`

`scaled_value = 1.5`

### 清查

`country:uk` 的样本数是否符合你的预期？
`language:jp` 是否真的应该作为你数据集中的最常用语言？

## 特征组合：对非线性规律进行编码

### 特征组合的种类

[A X B]：将两个特征的值相乘形成的特征组合。

[A x B x C x D x E]：将五个特征的值相乘形成的特征组合。

[A x A]：对单个特征的值求平方形成的特征组合。

## 特征组合：组合独热矢量

`country:usa AND language:spanish`

binned_latitude = [0, 0, 0, 1, 0]

binned_longitude = [0, 1, 0, 0, 0]

binned_latitude X binned_longitude

```binned_latitude(lat) = [
0  < lat <= 10
10 < lat <= 20
20 < lat <= 30
]

binned_longitude(lon) = [
0  < lon <= 15
15 < lon <= 30
]

binned_latitude_X_longitude(lat, lon) = [
0  < lat <= 10 AND 0  < lon <= 15
0  < lat <= 10 AND 15 < lon <= 30
10 < lat <= 20 AND 0  < lon <= 15
10 < lat <= 20 AND 15 < lon <= 30
20 < lat <= 30 AND 0  < lon <= 15
20 < lat <= 30 AND 15 < lon <= 30
]```

`[behavior type X time of day]`