## 任务说明

Binary classification is one of the most fundamental problem in machine learning. In this tutorial, you are going to build linear binary classifiers to predict whether the income of an indivisual exceeds 50,000 or not. We presented a discriminative and a generative approaches, the logistic regression(LR) and the linear discriminant anaysis(LDA). You are encouraged to compare the differences between the two, or explore more methodologies.

## 数据说明

#### X_train与X_test格式相同，利用jupter notebook打开X_train

```import numpy as np
import pandas as pd
np.random.seed(0)
X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_train'
Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/Y_train'
X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_test'
output_fpath = './output_{}.csv'
data```

#### 打开Y_train文件

```target = pd.read_csv(Y_train_fpath,index_col=0)
target```

## 预处理

### 将数据转化为array

```import numpy as np
np.random.seed(0)
X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_train'
Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/Y_train'
X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_test'
output_fpath = './output_{}.csv'
# Parse csv files to numpy array
with open(X_train_fpath) as f:
next(f)                    #next()读取下一行
X_train = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('
').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
print(X_train)
print(Y_train)
print(X_test)```

```out:
[[33.  1.  0. ... 52.  0.  1.]
[63.  1.  0. ... 52.  0.  1.]
[71.  0.  0. ...  0.  0.  1.]
...
[16.  0.  0. ...  8.  1.  0.]
[48.  1.  0. ... 52.  0.  1.]
[48.  0.  0. ...  0.  0.  1.]]
[1. 0. 0. ... 0. 0. 0.]
[[37.  1.  0. ... 52.  0.  1.]
[48.  1.  0. ... 52.  0.  1.]
[68.  0.  0. ...  0.  1.  0.]
...
[38.  1.  0. ... 52.  0.  1.]
[17.  0.  0. ... 40.  1.  0.]
[22.  0.  0. ... 25.  1.  0.]]```

### 标准化

X：是指需要处理的数据
train：布尔变量，True表示训练集，False表示测试集
specified_column：定义了需要被标准化的列。如果输入为None，则表示所有列都需要被标准化。
X_mean：训练集中每一列的均值。
X_std：训练集中每一列的方差。

```def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
if specified_column == None:
specified_column = np.arange(X.shape[1])
if train:
X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)
X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8) #1e-8防止除零
return X, X_mean, X_std
# 标准化训练数据和测试数据
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)
# 用 _ 这个变量来存储函数返回的无用值```

```out:
[[-0.42755297  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[ 1.19978055  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[ 1.63373616 -1.00040556 -0.1822401  ... -1.4553617  -1.01485523
1.01485523]
...
[-1.34970863 -1.00040556 -0.1822401  ... -1.10738915  0.9853622
-0.9853622 ]
[ 0.38611379  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[ 0.38611379 -1.00040556 -0.1822401  ... -1.4553617  -1.01485523
1.01485523]]
[[-0.21057517  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[ 0.38611379  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[ 1.47100281 -1.00040556 -0.1822401  ... -1.4553617   0.9853622
-0.9853622 ]
...
[-0.15633072  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
1.01485523]
[-1.29546418 -1.00040556 -0.1822401  ...  0.28450104  0.9853622
-0.9853622 ]
[-1.02424193 -1.00040556 -0.1822401  ... -0.36794749  0.9853622
-0.9853622 ]]```

### 分割测试集与验证集

```def _train_dev_split(X, Y, dev_ratio = 0.25):
# This function spilts data into training set and development set.
train_size = int(len(X) * (1 - dev_ratio))
return X[:train_size], Y[:train_size], X[train_size:], Y[train_size:]

# 把数据分成训练集和验证集
# 这里的Development set即为验证集
dev_ratio = 0.1
X_train, Y_train, X_dev, Y_dev = _train_dev_split(X_train, Y_train, dev_ratio = dev_ratio)
train_size = X_train.shape[0]    #训练集
dev_size = X_dev.shape[0]     #验证集
test_size = X_test.shape[0]    #测试集
data_dim = X_train.shape[1]
print('Size of training set: {}'.format(train_size))
print('Size of development set: {}'.format(dev_size))
print('Size of testing set: {}'.format(test_size))
print('Dimension of data: {}'.format(data_dim))```

```out:
Size of training set: 48830
Size of development set: 5426
Size of testing set: 27622
Dimension of data: 510```

```def _shuffle(X, Y):
# This function shuffles two equal-length list/array, X and Y, together.
randomize = np.arange(len(X))
np.random.shuffle(randomize)
return (X[randomize], Y[randomize])
def _sigmoid(z):
# Sigmoid function can be used to calculate probability.
# To avoid overflow, minimum/maximum output value is set.
return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))
# 该函数的作用是将数组a中的所有数限定到范围1e-8和1 - (1e-8)之中。
# 部分参数解释：
# a_min：被限定的最小值，所有比 1e-8 小的数都会强制变为 1e-8；
# a_max：被限定的最大值，所有比 1 - (1e-8) 大的数都会强制变为1 - (1e-8)；
def _f(X, w, b):
# This is the logistic regression function, parameterized by w and b
# Arguements:
#     X: input data, shape = [batch_size, data_dimension]
#     w: weight vector, shape = [data_dimension, ]
#     b: bias, scalar
# Output:
#     predicted probability of each row of X being positively labeled, shape = [batch_size, ]
return _sigmoid(np.matmul(X, w) + b)
def _predict(X, w, b):
# This function returns a truth value prediction for each row of X
# by rounding the result of logistic regression function.
return np.round(_f(X, w, b)).astype(np.int)

def _accuracy(Y_pred, Y_label):
# This function calculates prediction accuracy
acc = 1 - np.mean(np.abs(Y_pred - Y_label))
return acc```

## Logistic Regression

### 损失函数（交叉熵的求和）和梯度

```def _cross_entropy_loss(y_pred, Y_label):
# This function computes the cross entropy.
#
# Arguements:
#     y_pred: probabilistic predictions, float vector
#     Y_label: ground truth labels, bool vector
# Output:
#     cross entropy, scalar
cross_entropy = -np.dot(Y_label, np.log(y_pred)) - np.dot((1 - Y_label), np.log(1 - y_pred))
return cross_entropy```

```def _gradient(X, Y_label, w, b):
# This function computes the gradient of cross entropy loss with respect to weight w and bias b.
y_pred = _f(X, w, b)
pred_error = Y_label - y_pred
w_grad = -np.sum(pred_error * X.T, 1)

### Training

mini-batch SGD：就是选着合适Batch Size的SGD算法，mini-batch利用噪声梯度，一定程度上缓解了GD算法直接掉进初始点附近的局部最优值。同时梯度准确了，学习率要加大。

```# 初始化权重w和b，令它们都为0
w = np.zeros((data_dim,))#[0,0,0,...,0]
b = np.zeros((1,)) #[0]
# 训练时的超参数
max_iter = 10
batch_size = 8
learning_rate = 0.2
# 保存每个iteration的loss和accuracy，以便后续画图
train_loss = []
dev_loss = []
train_acc = []
dev_acc = []
# 累计参数更新的次数
step = 1
# 迭代训练
for epoch in range(max_iter):
# 在每个epoch开始时，随机打散训练数据
X_train, Y_train = _shuffle(X_train, Y_train)

# Mini-batch训练
for idx in range(int(np.floor(train_size / batch_size))):
X = X_train[idx*batch_size:(idx+1)*batch_size]
Y = Y_train[idx*batch_size:(idx+1)*batch_size]
# 计算梯度

# 梯度下降法更新
# 学习率随时间衰减
w = w - learning_rate/np.sqrt(step) * w_grad
b = b - learning_rate/np.sqrt(step) * b_grad
step = step + 1

# 计算训练集和验证集的loss和accuracy
y_train_pred = _f(X_train, w, b)
Y_train_pred = np.round(y_train_pred)
train_acc.append(_accuracy(Y_train_pred, Y_train))
train_loss.append(_cross_entropy_loss(y_train_pred, Y_train) / train_size)
y_dev_pred = _f(X_dev, w, b)
Y_dev_pred = np.round(y_dev_pred)
dev_acc.append(_accuracy(Y_dev_pred, Y_dev))
dev_loss.append(_cross_entropy_loss(y_dev_pred, Y_dev) / dev_size)
#输出最后一个值
print('Training loss: {}'.format(train_loss[-1]))
print('Development loss: {}'.format(dev_loss[-1]))
print('Training accuracy: {}'.format(train_acc[-1]))
print('Development accuracy: {}'.format(dev_acc[-1]))```

```Training loss: 0.27375098820698607
Development loss: 0.29846019916163835
Training accuracy: 0.8825107515871391
Development accuracy: 0.877441946185035```

### 画出loss和accuracy的曲线

```import matplotlib.pyplot as plt
# Loss curve
plt.plot(train_loss)
plt.plot(dev_loss)
plt.title('Loss')
plt.legend(['train', 'dev'])
plt.savefig('loss.png')
plt.show()
# Accuracy curve
plt.plot(train_acc)
plt.plot(dev_acc)
plt.title('Accuracy')
plt.legend(['train', 'dev'])
plt.savefig('acc.png')
plt.show()```

### 测试

```output_fpath = './output_{}.csv'
# Predict testing labels
predictions = _predict(X_test, w, b)
with open(output_fpath.format('logistic'), 'w') as f:
f.write('id,label
')
for i, label in  enumerate(predictions):
f.write('{},{}
'.format(i, label))
# Print out the most significant weights
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
').split(',')
features = np.array(content)
for i in ind[0:10]:
print(features[i], w[i])```

```Other Rel <18 never married RP of subfamily -1.5156535032617535
Other Rel <18 ever marr RP of subfamily -1.2493025752946474
Unemployed full-time 1.1489343960724647
1 0.8323252735693378
Italy -0.7951922604515268
Neither parent present -0.7749673709650178
Kentucky -0.7717486769177805
num persons worked for employer 0.7617890642364086
Householder -0.753455652297259
dividends from stocks -0.6728525747897033```

## 概率生成模型（Porbabilistic generative model）

### 数据预处理

```# Parse csv files to numpy array
with open(X_train_fpath) as f:
next(f)
X_train = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('
').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
# Normalize training and testing data
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)```

### 均值和协方差矩阵

```# 分别计算类别0和类别1的均值
X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if y == 0])
X_train_1 = np.array([x for x, y in zip(X_train, Y_train) if y == 1])
mean_0 = np.mean(X_train_0, axis = 0)
mean_1 = np.mean(X_train_1, axis = 0)
# 分别计算类别0和类别1的协方差
cov_0 = np.zeros((data_dim, data_dim))
cov_1 = np.zeros((data_dim, data_dim))
for x in X_train_0:
cov_0 += np.dot(np.transpose([x - mean_0]), [x - mean_0]) / X_train_0.shape[0]
for x in X_train_1:
cov_1 += np.dot(np.transpose([x - mean_1]), [x - mean_1]) / X_train_1.shape[0]
# 共享协方差 = 独立的协方差的加权求和
cov = (cov_0 * X_train_0.shape[0] + cov_1 * X_train_1.shape[0]) / (X_train_0.shape[0] + X_train_1.shape[0])```

### 计算权重和偏差

```# 计算协方差矩阵的逆
# 协方差矩阵可能是奇异矩阵, 直接使用np.linalg.inv() 可能会产生错误
# 通过SVD矩阵分解，可以快速准确地获得方差矩阵的逆
u, s, v = np.linalg.svd(cov, full_matrices=False)
inv = np.matmul(v.T * 1 / s, u.T)
# 计算w和b
w = np.dot(inv, mean_0 - mean_1)
b =  (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\
+ np.log(float(X_train_0.shape[0]) / X_train_1.shape[0])
# 计算训练集上的准确率
Y_train_pred = 1 - _predict(X_train, w, b)
#这边别和逻辑回归弄混了_predict(X_train, w, b)算出来是属于第0类的概率
print('Training accuracy: {}'.format(_accuracy(Y_train_pred, Y_train)))```

`Training accuracy: 0.8719404305514598`

### 预测：

```# Predict testing labels
predictions = 1 - _predict(X_test, w, b)
with open(output_fpath.format('generative'), 'w') as f:
f.write('id,label
')
for i, label in  enumerate(predictions):
f.write('{},{}
'.format(i, label))
# Print out the most significant weights
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
').split(',')
features = np.array(content)
for i in ind[0:10]:
print(features[i], w[i])```

```Professional specialty -7.375
7 6.8125
29 6.7109375
MSA to nonMSA -6.5
Finance insurance and real estate -6.3125
Different state same division 6.078125
Sales -5.15625
34 -5.041015625```

## 模型修改

### 引入二次项

```def _add_feature(X):
X_2 = np.power(X,2)
X = np.concatenate([X,X_2], axis=1)
return X
# 引入二次项

```# adagrad所需的累加和
eps = 1e-8
# 迭代训练
for epoch in range(max_iter):
# 在每个epoch开始时，随机打散训练数据
X_train, Y_train = _shuffle(X_train, Y_train)
# Mini-batch训练
for idx in range(int(np.floor(train_size / batch_size))):
X = X_train[idx * batch_size:(idx + 1) * batch_size]
Y = Y_train[idx * batch_size:(idx + 1) * batch_size]
# 计算梯度