#### 1.我们使用鸢尾花数据集，该数据集是UCI数据库中常用数据集。我们可以直接加载数据集，并尝试对数据进行一定探索：

```import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
X = iris.data
Y = iris.target
print(X.shape)
print(Y.shape)```

```(150, 4)
(150,)```

#### 第一种方法

```# 方法1# 使用concatenate函数进行拼接，因为传入的矩阵必须具有相同的形状。因此需要对label进行reshape操作，reshape(-1,1)表示行数自动计算，1列。axis=1表示纵向拼接。
tempConcat = np.concatenate((X, y.reshape(-1,1)), axis=1)
# 拼接好后，直接进行乱序操作
np.random.shuffle(tempConcat)
# 再将shuffle后的数组使用split方法拆分
shuffle_X,shuffle_y = np.split(tempConcat, [4], axis=1)
# 设置划分的比例
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
X_train = shuffle_X[test_size:]
y_train = shuffle_y[test_size:]
X_test = shuffle_X[:test_size]
y_test = shuffle_y[:test_size]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)```

```(120, 4)
(30, 4)
(120, 1)
(30, 1)```

#### 第二种方法

```# 方法2# 将x长度这幺多的数，返回一个新的打乱顺序的数组，注意，数组中的元素不是原来的数据，而是混乱的索引
shuffle_index = np.random.permutation(len(X))
# 指定测试数据的比例
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_index = shuffle_index[:test_size]
train_index = shuffle_index[test_size:]
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)```

```(120, 4)
(30, 4)
(120,)
(30,)```

#### 3.编写自己的train_test_split

```def train_test_split(X,y,test_ratio=0.2,seed=None):
"""将矩阵X和标签y按照test_ration分割成X_train, X_test, y_train, y_test"""
# assert X.shape[0] == y.shape[0]       "the size of X must be equal to the size of y"
# assert 0.0 <= test_ratio <= 1.0       "test_train must be valid"
if seed:
np.random.seed(seed)
shuffle_index = np.random.permutation(len(X))
test_size = int(len(X) * test_ratio)
test_index = shuffle_index[:test_size]
train_index = shuffle_index[test_size:]
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
return X_train, X_test, y_train, y_test
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)```

debug下可得：

```(120, 4)
(30, 4)
(120,)
(30,)```

#### 4.sklearn中的train_test_split

```from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)```

debug可得：

```(120, 4)
(30, 4)
(120,)
(30,)```

### 二.分类准确度accuracy

accuracy_score：函数计算分类准确率，返回被正确分类的样本比例（default）或者是数量（normalize=False） 在多标签分类问题中， 该函数返回子集的准确率 ，对于一个给定的多标签样本，如果预测得到的标签集合与该样本真正的标签集合严格吻合，则subset accuracy =1.0否则是0.0

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
a=accuracy_score(y_test, y_predict)
print(y_predict)
# 不看y_predict
print(knn_clf.score(X_test,y_test))```

debug结果：

`1.0`

### 三.超参数

#### 3.2 寻找好的k

```# 指定最佳值的分数，初始化为0.0；设置最佳值k，初始值为-1
best_score = 0.0
best_k = -1
for k in range(1, 11):  # 暂且设定到1～11的范围内
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
print("best_k = ", best_k)
print("best_score = ", best_score)```

#### 3.3 另一个超参数：权重

```# 两种方式进行比较
best_method = ""
best_score = 0.0
best_k = -1
for method in ["uniform","distance"]:
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method, p=2)
knn_clf.fit(X_train, y_train)
score = knn_clf.score(X_test, y_test)
if score > best_score:
best_k = k
best_score = score
best_method = method
print("best_method = ", method)
print("best_k = ", best_k)
print("best_score = ", best_score)```

#### 3.4超参数网格搜索

```param_search = [
{        "weights":["uniform"],        "n_neighbors":[i for i in range(1,11)]
},
{        "weights":["distance"],        "n_neighbors":[i for i in range(1,11)],        "p":[i for i in range(1,6)]
}
]```

```knn_clf = KNeighborsClassifier()
# 调用网格搜索方法
from sklearn.model_selection import GridSearchCV
# 定义网格搜索的对象grid_search，其构造函数的第一个参数表示对哪一个分类器进行算法搜索，第二个参数表示网格搜索相应的参数
grid_search = GridSearchCV(knn_clf, param_search)```

`print(grid_search.fit(X_train, y_train))`

```# 返回的是网格搜索搜索到的最佳的分类器对应的参数
print(grid_search.best_estimator_)```