# 统计学习方法-朴素贝叶斯笔记

4 Applications of Naive Bayes Algorithms

Real time Prediction
: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time

Multi class Prediction
: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable

Text classification/ Spam Filtering/ Sentiment Analysis
: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System
: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

（数字识别），简单的朴素贝叶斯算法代码实现如下（多项式模型）：

```class MultinomialNB:
'''
fit函数输入参数：
X 测试数据集
y 标记数据
alpha 贝叶斯估计的正数λ
predict函数输入参数：
test 测试数据集
'''
def fit(self, X, y, alpha = 0):
# 整理分类
feature_data = defaultdict(lambda: [])
label_data = defaultdict(lambda: 0)
for feature, lab in zip(X, y):
feature_data[lab].append(feature)
label_data[lab] += 1
# 计算先验概率
self.label = y
self.pri_p_label = {k: (v + alpha)/(len(self.label) + len(np.unique(self.label)) * alpha) for k,v in label_data.items()}
# 计算不同特征值的条件概率
self.cond_p_feature = defaultdict(lambda: {})
for i,sub in feature_data.items():
sub = np.array(sub)
for f_dim in range(sub.shape[1]):
for feature in np.unique(X[:,f_dim]):
self.cond_p_feature[i][(f_dim,feature)] = (np.sum(sub[:,f_dim] == feature) + alpha) / (sub.shape[0] + len(np.unique(X[:,f_dim])) * alpha)
def predict(self, test):
p_data = {}
for sub_label in np.unique(self.label):
# 对概率值取log，防止乘积时浮点下溢
p_data[sub_label] = self.pri_p_label[sub_label]
for i in range(len(test)):
if self.cond_p_feature[sub_label].get((i,test[i])):
p_data[sub_label] *= self.cond_p_feature[sub_label][(i,test[i])]
opt_label = max(p_data, key = p_data.get)
return([opt_label, p_data.get(opt_label)])```

```import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split
dataset = np.array(dataset)
dataset[:,1:][dataset[:,1:] != 0] = 1
label = dataset[:,0]
# 分割训练集和测试集
train_dat, test_dat, train_label, test_label = train_test_split(dataset, label, test_size = 0.2, random_state = 123456)
# 构建NB模型
model = MultinomialNB()
model.fit(X=train_dat, y=train_label, alpha=1)
# 使用NB模型进行预测
pl = {}
i = 0
for test in test_dat:
temp = model.predict(test=test)
pl[i] = temp
i += 1
# 输出测试错误率%
error = 0
for k,v in pl.items():
if test_label[k] != v[0]:
error += 1
print(error/len(test_label)*100)```