Press "Enter" to skip to content

李宏毅深度学习作业二

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

任务说明

 

Binary classification is one of the most fundamental problem in machine learning. In this tutorial, you are going to build linear binary classifiers to predict whether the income of an indivisual exceeds 50,000 or not. We presented a discriminative and a generative approaches, the logistic regression(LR) and the linear discriminant anaysis(LDA). You are encouraged to compare the differences between the two, or explore more methodologies.

 

总结:在本次作业中,需要写一个 线性二元分类器 ,根据人们的个人信息,判断其年收入是否高于 50,000 美元。作业将用 logistic regression 与 generative model 两种模型来实现目标,并且鼓励比较两者异同点

 

数据说明

 

这个资料集是由 UCI Machine Learning Repository 的 Census-Income (KDD) Data Set) 经过一些处理而得来。

 

事实上在训练过程中,只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到,train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。

 

原数据经过了如下处理

移除了一些不必要的数据
对离散值进行了one-hot编码
稍微平衡了正负标记的数据的比例的处理

X_train与X_test格式相同,利用jupter notebook打开X_train

 

import numpy as np
import pandas as pd
np.random.seed(0)
X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_train'
Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/Y_train'
X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_test'
output_fpath = './output_{}.csv'
data = pd.read_csv(X_train_fpath,index_col=0)
data

 

结果:

 

 

第一行是表头,从第二行开始是具体数据,表头是人们的个人信息,比如年龄、性别、学历、婚姻状况、孩子个数等等。

 

打开Y_train文件

 

target = pd.read_csv(Y_train_fpath,index_col=0)
target

 

结果:

 

 

只有两列,第一列是人的编号(ID),第二列是一个label——如果年收入>50K美元,label就是1;如果年收入≤50K美元,label就是0.

 

任务目标

 

输入:人们的个人信息

 

输出:0(年收入≤50K)或1(年收入>50K)

 

模型:logistic regression 或者 generative model

 

任务解答

 

预处理

 

将数据转化为array

 

import numpy as np
np.random.seed(0)
X_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_train'
Y_train_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/Y_train'
X_test_fpath = '/Users/zhucan/Desktop/李宏毅深度学习作业/第二次作业/X_test'
output_fpath = './output_{}.csv'
# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)                    #next()读取下一行
    X_train = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('
').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
print(X_train)
print(Y_train)
print(X_test)

 

out:
[[33.  1.  0. ... 52.  0.  1.]
 [63.  1.  0. ... 52.  0.  1.]
 [71.  0.  0. ...  0.  0.  1.]
 ...
 [16.  0.  0. ...  8.  1.  0.]
 [48.  1.  0. ... 52.  0.  1.]
 [48.  0.  0. ...  0.  0.  1.]]
[1. 0. 0. ... 0. 0. 0.]
[[37.  1.  0. ... 52.  0.  1.]
 [48.  1.  0. ... 52.  0.  1.]
 [68.  0.  0. ...  0.  1.  0.]
 ...
 [38.  1.  0. ... 52.  0.  1.]
 [17.  0.  0. ... 40.  1.  0.]
 [22.  0.  0. ... 25.  1.  0.]]

 

标准化

 

定义一个标准化函数_normalize():

X:是指需要处理的数据
train:布尔变量,True表示训练集,False表示测试集
specified_column:定义了需要被标准化的列。如果输入为None,则表示所有列都需要被标准化。
X_mean:训练集中每一列的均值。
X_std:训练集中每一列的方差。

然后,对X_train和X_test分别调用该函数,完成标准化。

 

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
        X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)
    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8) #1e-8防止除零
    return X, X_mean, X_std
# 标准化训练数据和测试数据
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)
# 用 _ 这个变量来存储函数返回的无用值

 

out:
[[-0.42755297  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [ 1.19978055  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [ 1.63373616 -1.00040556 -0.1822401  ... -1.4553617  -1.01485523
   1.01485523]
 ...
 [-1.34970863 -1.00040556 -0.1822401  ... -1.10738915  0.9853622
  -0.9853622 ]
 [ 0.38611379  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [ 0.38611379 -1.00040556 -0.1822401  ... -1.4553617  -1.01485523
   1.01485523]]
[[-0.21057517  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [ 0.38611379  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [ 1.47100281 -1.00040556 -0.1822401  ... -1.4553617   0.9853622
  -0.9853622 ]
 ...
 [-0.15633072  0.99959459 -0.1822401  ...  0.80645986 -1.01485523
   1.01485523]
 [-1.29546418 -1.00040556 -0.1822401  ...  0.28450104  0.9853622
  -0.9853622 ]
 [-1.02424193 -1.00040556 -0.1822401  ... -0.36794749  0.9853622
  -0.9853622 ]]

 

分割测试集与验证集

 

对原来的X_train进行分割,分割比例为train:dev = 9:1。这里没有shuffle,是固定分割

 

def _train_dev_split(X, Y, dev_ratio = 0.25):
    # This function spilts data into training set and development set.
    train_size = int(len(X) * (1 - dev_ratio))
    return X[:train_size], Y[:train_size], X[train_size:], Y[train_size:]
 
# 把数据分成训练集和验证集
# 这里的Development set即为验证集
dev_ratio = 0.1
X_train, Y_train, X_dev, Y_dev = _train_dev_split(X_train, Y_train, dev_ratio = dev_ratio)
train_size = X_train.shape[0]    #训练集
dev_size = X_dev.shape[0]     #验证集
test_size = X_test.shape[0]    #测试集
data_dim = X_train.shape[1]
print('Size of training set: {}'.format(train_size))
print('Size of development set: {}'.format(dev_size))
print('Size of testing set: {}'.format(test_size))
print('Dimension of data: {}'.format(data_dim))

 

out:
Size of training set: 48830
Size of development set: 5426
Size of testing set: 27622
Dimension of data: 510

 

这几个函数可能会在训练中被重复使用到

 

def _shuffle(X, Y):
    # This function shuffles two equal-length list/array, X and Y, together.
    randomize = np.arange(len(X))
    np.random.shuffle(randomize)
    return (X[randomize], Y[randomize])
def _sigmoid(z):
    # Sigmoid function can be used to calculate probability.
    # To avoid overflow, minimum/maximum output value is set.
    return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))
    # 该函数的作用是将数组a中的所有数限定到范围1e-8和1 - (1e-8)之中。
    # 部分参数解释:
    # a_min:被限定的最小值,所有比 1e-8 小的数都会强制变为 1e-8;
    # a_max:被限定的最大值,所有比 1 - (1e-8) 大的数都会强制变为1 - (1e-8);
def _f(X, w, b):
    # This is the logistic regression function, parameterized by w and b
    # Arguements:
    #     X: input data, shape = [batch_size, data_dimension]
    #     w: weight vector, shape = [data_dimension, ]
    #     b: bias, scalar
    # Output:
    #     predicted probability of each row of X being positively labeled, shape = [batch_size, ]
    return _sigmoid(np.matmul(X, w) + b)
def _predict(X, w, b):
    # This function returns a truth value prediction for each row of X 
    # by rounding the result of logistic regression function.
    return np.round(_f(X, w, b)).astype(np.int)
    
def _accuracy(Y_pred, Y_label):
    # This function calculates prediction accuracy
    acc = 1 - np.mean(np.abs(Y_pred - Y_label))
    return acc

 

Logistic Regression

 

损失函数(交叉熵的求和)和梯度

 

def _cross_entropy_loss(y_pred, Y_label):
    # This function computes the cross entropy.
    #
    # Arguements:
    #     y_pred: probabilistic predictions, float vector
    #     Y_label: ground truth labels, bool vector
    # Output:
    #     cross entropy, scalar
    cross_entropy = -np.dot(Y_label, np.log(y_pred)) - np.dot((1 - Y_label), np.log(1 - y_pred))
    return cross_entropy

 

def _gradient(X, Y_label, w, b):
    # This function computes the gradient of cross entropy loss with respect to weight w and bias b.
    y_pred = _f(X, w, b)
    pred_error = Y_label - y_pred
    w_grad = -np.sum(pred_error * X.T, 1)
    b_grad = -np.sum(pred_error)
    return w_grad, b_grad

 

Training

 

使用小批次(mini-batch)的梯度下降法来训练。训练数据被分为许多小批次,针对每一个小批次,我们分别计算其梯度以及损失,并根据该批次来更新模型的参数。当一次循环(iteration)完成,也就是整个训练集的所有小批次都被使用过一次以后,我们将所有训练数据打散并且重新分成新的小批次,进行下一个循环,直到事先设定的循环数量(iteration)达成为止。

GD(Gradient Descent):就是没有利用Batch Size,用基于整个数据库得到梯度,梯度准确,但数据量大时,计算非常耗时,同时神经网络常是非凸的,网络最终可能收敛到初始点附近的局部最优点。
SGD(Stochastic Gradient Descent):就是Batch Size=1,每次计算一个样本,梯度不准确,所以学习率要降低。
mini-batch SGD:就是选着合适Batch Size的SGD算法,mini-batch利用噪声梯度,一定程度上缓解了GD算法直接掉进初始点附近的局部最优值。同时梯度准确了,学习率要加大。

# 初始化权重w和b,令它们都为0
w = np.zeros((data_dim,))#[0,0,0,...,0]
b = np.zeros((1,)) #[0]
# 训练时的超参数
max_iter = 10
batch_size = 8
learning_rate = 0.2
# 保存每个iteration的loss和accuracy,以便后续画图
train_loss = []
dev_loss = []
train_acc = []
dev_acc = []
# 累计参数更新的次数
step = 1
# 迭代训练
for epoch in range(max_iter):
    # 在每个epoch开始时,随机打散训练数据
    X_train, Y_train = _shuffle(X_train, Y_train)
        
    # Mini-batch训练
    for idx in range(int(np.floor(train_size / batch_size))):
        X = X_train[idx*batch_size:(idx+1)*batch_size]
        Y = Y_train[idx*batch_size:(idx+1)*batch_size]
        # 计算梯度
        w_grad, b_grad = _gradient(X, Y, w, b)
            
        # 梯度下降法更新
        # 学习率随时间衰减
        w = w - learning_rate/np.sqrt(step) * w_grad
        b = b - learning_rate/np.sqrt(step) * b_grad
        step = step + 1
            
    # 计算训练集和验证集的loss和accuracy
    y_train_pred = _f(X_train, w, b)
    Y_train_pred = np.round(y_train_pred)
    train_acc.append(_accuracy(Y_train_pred, Y_train))
    train_loss.append(_cross_entropy_loss(y_train_pred, Y_train) / train_size)
    y_dev_pred = _f(X_dev, w, b)
    Y_dev_pred = np.round(y_dev_pred)
    dev_acc.append(_accuracy(Y_dev_pred, Y_dev))
    dev_loss.append(_cross_entropy_loss(y_dev_pred, Y_dev) / dev_size)
#输出最后一个值
print('Training loss: {}'.format(train_loss[-1]))  
print('Development loss: {}'.format(dev_loss[-1]))
print('Training accuracy: {}'.format(train_acc[-1]))
print('Development accuracy: {}'.format(dev_acc[-1]))

 

结果:

 

Training loss: 0.27375098820698607
Development loss: 0.29846019916163835
Training accuracy: 0.8825107515871391
Development accuracy: 0.877441946185035

 

画出loss和accuracy的曲线

 

import matplotlib.pyplot as plt
# Loss curve
plt.plot(train_loss)
plt.plot(dev_loss)
plt.title('Loss')
plt.legend(['train', 'dev'])
plt.savefig('loss.png')
plt.show()
# Accuracy curve
plt.plot(train_acc)
plt.plot(dev_acc)
plt.title('Accuracy')
plt.legend(['train', 'dev'])
plt.savefig('acc.png')
plt.show()

 

结果:

 

 

 

测试

 

 

output_fpath = './output_{}.csv'
# Predict testing labels
predictions = _predict(X_test, w, b)
with open(output_fpath.format('logistic'), 'w') as f:
    f.write('id,label
')
    for i, label in  enumerate(predictions):
        f.write('{},{}
'.format(i, label))
# Print out the most significant weights
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
    content = f.readline().strip('
').split(',')
features = np.array(content)
for i in ind[0:10]:
    print(features[i], w[i])

 

结果:

 

Other Rel <18 never married RP of subfamily -1.5156535032617535
 Other Rel <18 ever marr RP of subfamily -1.2493025752946474
 Unemployed full-time 1.1489343960724647
 1 0.8323252735693378
 Italy -0.7951922604515268
 Neither parent present -0.7749673709650178
 Kentucky -0.7717486769177805
num persons worked for employer 0.7617890642364086
 Householder -0.753455652297259
dividends from stocks -0.6728525747897033

 

概率生成模型(Porbabilistic generative model)

 

训练集与测试集的处理方法跟 logistic regression 一模一样,然而因为 generative model 有可解析的最佳解,因此不必使用到验证集(development set)

 

数据预处理

 

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('
').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('
').split(',')[1:] for line in f], dtype = float)
# Normalize training and testing data
X_train, X_mean, X_std = _normalize(X_train, train = True)
X_test, _, _= _normalize(X_test, train = False, specified_column = None, X_mean = X_mean, X_std = X_std)

 

均值和协方差矩阵

 

 

# 分别计算类别0和类别1的均值
X_train_0 = np.array([x for x, y in zip(X_train, Y_train) if y == 0]) 
X_train_1 = np.array([x for x, y in zip(X_train, Y_train) if y == 1])
mean_0 = np.mean(X_train_0, axis = 0)
mean_1 = np.mean(X_train_1, axis = 0)  
# 分别计算类别0和类别1的协方差
cov_0 = np.zeros((data_dim, data_dim))
cov_1 = np.zeros((data_dim, data_dim))
for x in X_train_0:
    cov_0 += np.dot(np.transpose([x - mean_0]), [x - mean_0]) / X_train_0.shape[0]
for x in X_train_1:
    cov_1 += np.dot(np.transpose([x - mean_1]), [x - mean_1]) / X_train_1.shape[0]
# 共享协方差 = 独立的协方差的加权求和
cov = (cov_0 * X_train_0.shape[0] + cov_1 * X_train_1.shape[0]) / (X_train_0.shape[0] + X_train_1.shape[0])

 

计算权重和偏差

 

权重矩阵与偏差向量可以直接被计算出来,详情可见视频P10 Classification:

 

 

# 计算协方差矩阵的逆
# 协方差矩阵可能是奇异矩阵, 直接使用np.linalg.inv() 可能会产生错误
# 通过SVD矩阵分解,可以快速准确地获得方差矩阵的逆
u, s, v = np.linalg.svd(cov, full_matrices=False)
inv = np.matmul(v.T * 1 / s, u.T)
# 计算w和b
w = np.dot(inv, mean_0 - mean_1)
b =  (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\
    + np.log(float(X_train_0.shape[0]) / X_train_1.shape[0]) 
# 计算训练集上的准确率
Y_train_pred = 1 - _predict(X_train, w, b)
#这边别和逻辑回归弄混了_predict(X_train, w, b)算出来是属于第0类的概率
print('Training accuracy: {}'.format(_accuracy(Y_train_pred, Y_train)))

 

结果:

 

Training accuracy: 0.8719404305514598

 

预测:

 

# Predict testing labels
predictions = 1 - _predict(X_test, w, b)
with open(output_fpath.format('generative'), 'w') as f:
    f.write('id,label
')
    for i, label in  enumerate(predictions):
        f.write('{},{}
'.format(i, label))
# Print out the most significant weights
ind = np.argsort(np.abs(w))[::-1]
with open(X_test_fpath) as f:
    content = f.readline().strip('
').split(',')
features = np.array(content)
for i in ind[0:10]:
    print(features[i], w[i])

 

结果:

 

Professional specialty -7.375
 7 6.8125
 Retail trade 6.76953125
 29 6.7109375
 MSA to nonMSA -6.5
 Finance insurance and real estate -6.3125
 Different state same division 6.078125
 Abroad -6.0
 Sales -5.15625
 34 -5.041015625

 

模型修改

 

引入二次项

 

def _add_feature(X):
    X_2 = np.power(X,2)
    X = np.concatenate([X,X_2], axis=1)
    return X
# 引入二次项
X_train = _add_feature(X_train)
X_test = _add_feature(X_test)

 

adagrad

 

# adagrad所需的累加和
adagrad_w = 0
adagrad_b = 0
# 防止adagrad除零
eps = 1e-8
# 迭代训练
for epoch in range(max_iter):
    # 在每个epoch开始时,随机打散训练数据
    X_train, Y_train = _shuffle(X_train, Y_train)
    # Mini-batch训练
    for idx in range(int(np.floor(train_size / batch_size))):
        X = X_train[idx * batch_size:(idx + 1) * batch_size]
        Y = Y_train[idx * batch_size:(idx + 1) * batch_size]
        # 计算梯度
        w_grad, b_grad = _gradient(X, Y, w, b)
        adagrad_w += w_grad**2
        adagrad_b += b_grad**2
        # 梯度下降法adagrad更新w和b
        w = w - learning_rate / (np.sqrt(adagrad_w + eps)) * w_grad
        b = b - learning_rate / (np.sqrt(adagrad_b + eps)) * b_grad

Be First to Comment

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注