为了实现分类算法,我们使用最经典的iris数据集。首先导入对应的数据集,这里假设已经进行了相关的数据预处理(清洗、去重、补全)以及正则化后。
之后将数据集拆分出训练集和测试集,用于交叉验证。
>>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> data = load_iris() >>> x = data['data'] >>> y = data['target'] >>> x_train,x_test,y_train,y_test = train_test_split(x,y,shuffle=True)
首先我们采用kNN算法进行测试:
>>> from sklearn.neighbors import KNeighborsClassifier >>> k = 3 >>> clf = KNeighborsClassifier(n_neighbors=k) >>> clf.fit(x_train,y_train)
接着我们查看其准确度:
>>> clf.score(x_train,y_train) 0.9910714285714286 >>> clf.score(x_test,y_test) 0.8947368421052632
我们取出1个元素查看其内容:
>>> x_sample = x[0] >>> x_sample array([5.1, 3.5, 1.4, 0.2]) >>> y_sample = y[0] >>> y_sample 0 >>> y_pred = clf.predict(x_sample.reshape(-1,4)) #输入与训练时相同的格式 >>> y_pred array([0]) >>> neighbors = clf.kneighbors(x_sample.reshape(-1,4),return_distance=False) >>> neighbors array([[ 70, 106, 40]]) >>> y[neighbors] array([[1, 2, 0]])
我们随机挑选了1个数据作为样例传入,可以发现其预测是正确的。之后获取其近邻,发现结果有点差强人意。而且训练集与测试集的分数之差有10%的大小。
之后我们查看决策树的效果:
>>> from sklearn.tree import DecisionTreeClassifier >>> clf = DecisionTreeClassifier(min_samples_split=3) >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 1.0 >>> clf.score(x_test,y_test) 0.8947368421052632 >>> x_sample = x[70] >>> y_sample = y[70] >>> y_sample 1 >>> clf.predict(x_sample.reshape(-1,4)) array([2])
可以发现测试集的结果还是偏低,说明过拟合了。此时对于决策树而言,需要剪枝,即降低层数。
下面试下随机森林的效果:
>>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.metrics import classification_report,accuracy_score >>> estimator = RandomForestClassifier(n_estimators=100) >>> estimator.fit(x_train,y_train) >>> x_pred = estimator.predict(x_train) >>> x_score = accuracy_score(y_train,x_pred) #用于分类时分数的计算,与score方法一致 >>> x_score 1.0 >>> test_pred = estimator.predict(x_test) >>> test_score = accuracy_score(y_test,test_pred) >>> test_score 0.8947368421052632
之后使用向量机来看下分类的结果:
>>> from sklearn.svm import SVC >>> clf = SVC(c=1.0,kernel='linear') >>> clf.score(x_test,y_test) 0.9473684210526315
可以发现使用向量机的方式比之前的方式要好。
对于朴素贝叶斯查看下其效果:
>>> from sklearn.naive_bayes import MultinomialNB >>> clf = MultinomialNB(alpha=0.0001) >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 0.8214285714285714 >>> clf.score(x_test,y_test) 0.7105263157894737
可以发现使用多项分布的贝叶斯分类反而效果很一般。如果使用的是高斯分布则有:
>>> from sklearn.naive_bayes import GaussianNB >>> clf = GaussianNB() >>> clf.fit(x_train,y_train) >>> clf.score(x_train,y_train) 0.9642857142857143 >>> clf.score(x_test,y_test) 0.8947368421052632 >>> pred = clf.predict(x_test) >>> print(classification_report(y_test,pred)) precision recall f1-score support 0 1.00 1.00 1.00 10 1 0.88 0.88 0.88 16 2 0.83 0.83 0.83 12 micro avg 0.89 0.89 0.89 38 macro avg 0.90 0.90 0.90 38 weighted avg 0.89 0.89 0.89 38
可以发现其准确率又有所提升。
Be First to Comment