决策树

工作原理

构造

1. 当前结点包含的样本全属于同一类别, 无需划分

1. 当前属性集为空, 或是所有样本在所有属性上取值相同, 无法划分

1. 当前结点包含的样本集合为空, 不能划分

信息熵

$$\operatorname{Ent}(D)=-\sum_{k=1}^{|\mathcal{Y}|} p_{k} \log _{2} p_{k}$$

信息增益

$$\operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)$$

ID3 算法的优点是方法简单，缺点是对噪声敏感。训练数据如果有少量错误，可能会产生决策树分类错误

信息增益率

$$\text {GainRatio}(D, a)=\frac{\operatorname{Gain}(D, a)}{Ent(D)}$$

C4.5 在 IID3 的基础上，用信息增益率代替了信息增益，解决了噪声敏感的问题，并且可以对构造树进行剪枝、处理连续数值以及数值缺失等情况，但是由于 C4.5 需要对数据集进行多次扫描，算法效率相对较低

基尼指数

$$G(p)=\sum_{k=1}^{K} p_{k}\left(1-p_{k}\right)=1-\sum_{k=1}^{K} p_{k}^{2}$$

ID3 决策树代码

# coding = utf-8
from math import log
import numpy as np
from collections import Counter
class DecisionTree:
"""ID3 DecisionTree
"""
def __init__(self):
self.decisionTree = None
self._X = None
self._y = None
# 计算信息熵
def calcShannonEnt(self,y):
lablesCounter = Counter(y)
shannonEnt = 0.0
for num in lablesCounter.values():
p = num / len(y)
shannonEnt += -p * log(p,2)
return shannonEnt
def fit(self, X, y):
self._X = X
self._y = y
self.decisionTree = self.createTree(self._X,self._y)
return self
def splitDataset(self,X,y,d,value):
features = X[X[:,d]==value]
labels = y[X[:,d]==value]
return np.concatenate((features[:,:d],features[:,d+1:]),axis=1), labels
def chooseBestFeatureToSplit(self,X,y):
numFeatures = X.shape[1]
baseEntropy = self.calcShannonEnt(y)
bestInfoGain, bestFeature = 0.0, -1
for i in range(numFeatures):
# 创建唯一的分类标签列表
uniqueVals = np.unique(X[:,i])
newEntropy =0.0
# 计算每种划分方式的信息熵
for value in uniqueVals:
_x, _y = self.splitDataset(X,y,i,value)
prob = len(_x)/len(X)
newEntropy += prob * self.calcShannonEnt(_y)
infoGain = baseEntropy - newEntropy
if infoGain>bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
return bestFeature
def majorityCnt(self,y):
lablesCounter = Counter(y)
return lablesCounter.most_common(1)[0]
def createTree(self,X,y):
# 类别完全相同则停止继续划分
if y[y == y[0]].size == y.size :
return y[0]
# 遍历完所有特征时返回出现次数最多的类别
if X.shape[1] == 0:
return self.majorityCnt(y)
bestFeat = self.chooseBestFeatureToSplit(X,y)
decisionTree = {bestFeat: {}}
for value in np.unique(X[:,bestFeat]):
decisionTree[bestFeat][value] = self.createTree(*self.splitDataset(X,y,bestFeat, value))
return decisionTree
if __name__ == '__main__':
dataSet = np.array([[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']])
labels = ['no surfacing', 'flippers']
dt = DecisionTree()
X = dataSet[:, :2]
X = X.astype(np.int)
y = dataSet[:,-1]
dt.fit(X,y)
print(dt.decisionTree)

参考

Machine Learning in Action by Peter Harrington