Press "Enter" to skip to content

## PCA的思想

```import numpy as np
import matplotlib.pyplot as plt
X = np.empty((100, 2))
X[:, 0] = np.random.uniform(0., 100., size=100)
X[:, 1] = 0.75 * X[:, 0] + 3. + np.random.normal(0., 10., size=100)
plt.scatter(X[:, 0], X[:, 1])
plt.show()
def demean(X):
# 不使用standardscale标准化数据，求解PCA需要方差，只去均值。
return X - np.mean(X, axis=0) # axis=0表示最终求得的均值是一个行向量，也就是说将每一列求和
x_demean = demean(X)
plt.scatter(x_demean[:,0], x_demean[:,1])
plt.show()
np.mean(x_demean[:,0])
np.mean(x_demean[:,1])
def f(w ,x):
return np.sum((x.dot(w) ** 2)) / len(x)
def df_math(w, x):
return x.T.dot(x.dot(w)) * 2. / len(x)
def df_denug(w, x, epsilon=0.0001):
res = np.empty(len(w))
for i in range(len(w)):
w_1 = w.copy()
w_1[i] += epsilon
w_2 = w.copy()
w_2[i] -= epsilon
res[i] = (f(w_1, x) - f(w_2, x)) / (2 * epsilon)
return res
def direction(w):
return w / np.linalg.norm(w)
def gradient_ascent(df, x, init_w, eta, n_iters=1e4, epsilon=1e-8):
w = direction(init_w)
cur_iter = 0
while cur_iter < n_iters:
gradient = df(w, x)
last_w = w
w = w + eta * gradient
w = direction(w)
if (abs(f(w, x) - f(last_w, x)) < epsilon):
break
cur_iter += 1
return w```

```init_w = np.random.random(X.shape[1])    # 不能0向量开始
init_w
eta = 0.001
gradient_ascent(df_denug, x_demean, init_w, eta)```

```w = gradient_ascent(df_math, x_demean, init_w, eta)
plt.scatter(x_demean[:, 0], x_demean[:,1])
plt.plot([0, w[0]*30], [0, w[1]*30], color='r')
plt.show()```

## sklearn中的PCA

```import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
digits = datasets.load_digits()
x = digits.data
y = digits.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=666)
x_train.shape```

Output：(1437, 64)

```%%time
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train, y_train)```

Output：Wall time: 288 ms

`knn_clf.score(x_test, y_test)`

Output：0.9888888888888889

```from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(x_train)
x_train_reduction = pca.transform(x_train)
x_test_reduction = pca.transform(x_test)
%%time
knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train_reduction, y_train)```

Output：Wall time: 101 ms

`knn_clf.score(x_test_reduction, y_test)`

Output：0.6055555555555555

`pca.explained_variance_ratio_`

Output：array([0.1450646 , 0.13714246])

```pca = PCA(n_components=x_train.shape[1])
pca.fit(x_train)
pca.explained_variance_ratio_
plt.plot([i for i in range(x_train.shape[1])],
[np.sum(pca.explained_variance_ratio_[:i+1]) for i in range(x_train.shape[1])])
plt.show()```

```pca = PCA(0.95)
pca.fit(x_train)
pca.n_components_```

Output：28

```x_train_reduction = pca.transform(x_train)
x_test_reduction = pca.transform(x_test)
knn_clf = KNeighborsClassifier()
knn_clf.fit(x_train_reduction, y_train)
knn_clf.score(x_test_reduction, y_test)```

Output：0.9833333333333333

## PCA的特点

PCA算法的主要优点有：

1）仅仅需要以方差衡量信息量，不受数据集以外的因素影响。

2）各主成分之间正交，可消除原始数据成分间的相互影响的因素。

3）计算方法简单，主要运算是特征值分解，易于实现。

PCA算法的主要缺点有：

1）主成分各个特征维度的含义具有一定的模糊性，不如原始样本特征的解释性强。

2）方差小的非主成分也可能含有对样本差异的重要信息，因降维丢弃可能对后续数据处理有影响。