前言

KNN（k-nearest neighbors）又叫做K近邻，是机器学习中相对简单好理解的算法，并且它是个几乎不需要训练就可以得到预测结果的模型。

1. KNN理论基础

1.2 工作原理与特点

K近邻算法的工作原理如下：

K近邻算法的特点也显而易见，由于选择了K个邻近的参考点，因此它的精度较高，且对异常值不敏感，无数据输入假定，使用于数值型和标称型的数据。缺点是计算复杂度和空间复杂度双高。

1.4 距离计算

[1, 2, np.nan ] [3, 4, 3]; 3.46 [np.nan, 6, 5]; 6.93 [1, 2, 4 ]
[ np.nan , 6, 5] [3, 4, 3]; 3.46 [8, 8, 7]; 3.46 [ 5.5 , 6, 5]

2. 代码实践

2.1 鸢尾花案例

```import numpy as np
import matplotlib.pyplot as plt
# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split```

```# 载入鸢尾花数据集
# iris是一个对象类型的数据，其中包括了data（鸢尾花的特征）和target（也就是分类标签）
# 将样本与标签分开
x = iris['data']
y = iris['target']
print(x.shape, y.shape)  # (150, 4) (150,)
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)  # 8:2
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
# (120, 4) (30, 4) (120,) (30,)```

```clf = KNeighborsClassifier(n_neighbors=5, p=2, metric="minkowski")
clf.fit(x_train, y_train)  # fit可以简单的认为是表格存储
# KNeighborsClassifier()```

```y_predict = clf.predict(x_test)
y_predict.shape  # (30,)
acc = sum(y_predict == y_test) / y_test.shape[0]
acc```

2.2 马绞痛案例

```# 下载需要用到的数据集
!wget https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/3K/horse-colic.csv```

```import numpy as np
import pandas as pd
# kNN分类器
from sklearn.neighbors import KNeighborsClassifier
# kNN数据空值填充
from sklearn.impute import KNNImputer
# 计算带有空值的欧式距离
from sklearn.metrics.pairwise import nan_euclidean_distances
# 交叉验证
from sklearn.model_selection import cross_val_score
# KFlod的函数
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split```

```temp = pd.read_csv('KNN.csv', header=None)
temp  # 数据的第23列表示是否病变，1为yes，2为no```

```df = pd.read_csv('KNN.csv', header=None, na_values='?')  # na_values='?'
df```

```data = df.values  # 原始数据有300行，28列
x_index = [i for i in range(data.shape[1]) if i != 23]
x, y = data[:, x_index], data[:, 23]  # 单独提取出了病变结果列
print(x.shape, y.shape)  # (300, 27) (300,)
cols_null=[]
for i in range(x.shape[1]):
cols_null.append(df[i].isnull().sum())  # 每一列的数据缺失个数

cols_null # [1, 0, 0,60, 24, 58, ...]```

```imputer = KNNImputer()
# 填充数据集中的空值
x1 = imputer.fit_transform(x)  # 或者是fit和transform分开写
print(sum(np.isnan(x1)))  # 处理后不再存在空值
print(sum(np.isnan(x)))
# [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [  1   0   0  60  24  58  56  69  47  32  55  44  56 104 106 247 102 118 29  33 165 198   1   0   0   0   0]```

```# 使用数据管道来处理
pipe = Pipeline(steps=[('imputer', KNNImputer(n_neighbors=5)), ('model', c())])
# 得到训练集合和验证集合, 8: 2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# 验证model
pipe.fit(x_train, y_train)
score = pipe.score(x_test, y_test)
score  # 0.8166```

参考资料

《机器学习实战》

https://developer.aliyun.com/ai/scenario/febc2223e46f419dae84df47b1760ffc