## affinity analysis

，给出历史订单，当X和Y同时购买时，可以找出如下规则：

)，有多大可能性买Y( conclusion
)。

): 历史订单中出现premise->conclusion的个数

)：支持度/历史订单中出现premise的个数

## 实现OneR算法

)，也就是通过选择 一个

，书上是通过各个特征值的均值来作为阈值，大于均值为1，否则为0，这样各个特征值只有2种数值了。

)

```# X样本, y_true样本对应的类别，feature选择的特征，value特征的值
def train_feature_value(X, y_true, feature, value):
class_count = defaultdict(int)
for sample, cls in zip(X, y_true):
if sample[feature] == value:
class_count[cls] += 1
most_frequent_class = sorted(class_count.items(), key=itemgetter(1), reverse=True)[0][0]
error = sum([cnt for cls, cnt in class_count.items() if cls != most_frequent_class])
return most_frequent_class, error
def train(X, y_true, feature):
n_samples, n_features = X.shape
values = set(X[:, feature])
predictors = {}
errors = []
for current_value in values:
most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
predictors[current_value] = most_frequent_class
errors.append(error)
total_error = sum(errors)
return predictors, total_error
def OneR(X, y_true):
all_predictors = {}
errors = {}
for feature in range(X.shape[1]):
predictor, total_error = train(X, y_true, feature)
all_predictors[feature] = predictor
errors[feature] = total_error
feature = sorted(errors.items(), key=itemgetter(1))[0][0]
return {'feature': feature, 'predictor': all_predictors[feature]}```

，以防出现 overfitting

`sklearn.cross_validation`
(cross validation简称 CV
)的 `train_test_split`

```from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, Y)```

```def predict(X_test, model):
var = model['feature']
predictor = model['predictor']
y_predicted = np.array([predictor[int(sample[var])] for sample in X_test])
return y_predicted
In [371]: np.mean(predict(Xd_test, model) ==  y_test)
Out[371]: 0.65789473684210531```

## 用 `scikit-learn` 的 `Estimators` 分类

`scikit-learn`

Estimators: 用于分类、聚类、回归
Transformers: 用于预处理、选择数据
Pipelines: 将工作流程整合到一块复用

## Estimators

Estimators有2个比较重要的函数：

`fit()`
: 执行训练用算法，设置内部参数。有2个输入：训练集和对应的类型（就像前面提到的 `OneR(X, y_true)`

`predict()`
：只有一个测试集输入，输出预测的分类。（就像前面提到的 `predict(X_test, model)`

### 距离度量

Euclidean距离，也就是每个特征平方差的和的平方根（当特征比较多的时候，各个样本可能都差不多近。。）
Manhattan距离，每个特征的绝对差的和（当特征比较大的时候，可能会无视那些比较小的特征值了，可以通过正规化来解决）
Cosine距离，特征向量的夹角，丢失了长度信息（由于不考虑向量长度，适合多个特征的情况。比如可以用于文本挖掘）

### 读取数据集

```X = np.zeros((351, 34), dtype='float')
y = np.zeros((351,), dtype='bool')
with open('../data/ionosphere.data', 'r') as f:
data = [float(dat) for dat in row[:-1]]
X[i] = data
Y[i] = row[-1] == 'g'```

### 标准工作流程

```Xd = (X > X.mean(axis=0)).astype(int)
Xd_train, Xd_test, Y_train, Y_test = train_test_split(Xd, Y)
model = OneR(Xd_train, Y_train)
In [590]: (predict(Xd_test, model) == Y_test).mean()
Out[590]: 0.76136363636363635```

76.14%，还不错啊。。

```from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
estimator = KNeighborsClassifier()```

#### 开始训练：

```In [593]: estimator.fit(X_train, Y_train)
Out[593]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')```

```In [594]: (estimator.predict(X_test) == Y_test).mean()
Out[594]: 0.86363636363636365```

86.37%，精度明显提高了。还是默认参数，要是自己对应用比较熟悉的话，调参大法好。后面章节还会介绍参数搜索。( parameter search
)

### Cross-fold validation(CV)

）训练集和测试集的话，训练每一组求出其精度，在求出平均精度，能够较好的反映出算法的效果。

cross-fold validation框架很好地解决这个问题，过程如下：

1. 将整个数据集划分成K个部分(

folds

1. )，基本等大

1. 每部分，执行如下步骤

1. 将该部分作为测试集

1. 剩余(K-1)部分作为训练集

1. 评估当前测试集的精度

```from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.cross_validation import cross_val_score
class OneRES(BaseEstimator, ClassifierMixin):
def __init__(self):
self.model = {}
def fit(self, X, Y):
Xd = (X >= X.mean(axis=0)).astype(int)
self.model = OneR(Xd, Y)
def predict(self, X):
Xd = (X >= X.mean(axis=0)).astype(int)
return predict(Xd, self.model)
scores = cross_val_score(estimator, X, Y, scoring='accuracy')
In [669]: scores
Out[669]: array([ 0.66666667,  0.65811966,  0.68376068])
In [670]: scores.mean()
Out[670]: 0.66951566951566954```

```from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator, X, Y, scoring='accuracy')
In [623]: scores
Out[623]: array([ 0.82051282,  0.78632479,  0.86324786])
In [624]: scores.mean()
Out[624]: 0.8233618233618234```

### 调参

，可以通过设置为一个范围的值，来观察参数对结果的影响。

```from matplotlib import pyplot as plt
avg_score = []
para_values = list(range(1, 25 + 1))
for n_neighbors in para_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
scores = cross_val_score(estimator, X, Y, scoring='accuracy')
avg_score.append(scores.mean())
plt.plot(para_values, avg_score, '-o')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')```

![Nearest neighbors](Nearest neighbors.png)

## 预处理

```X_broken = X.copy()
X_broken[:, ::2] /= 10
estimator = KNeighborsClassifier()
In [732]: cross_val_score(estimator, X, Y, scoring='accuracy')
Out[732]: array([ 0.82051282,  0.78632479,  0.86324786])
In [733]: cross_val_score(estimator, X_broken, Y, scoring='accuracy')
Out[733]: array([ 0.75213675,  0.64957265,  0.74358974])```

### 标准预处理

`MinMaxScaler`

`from sklearn.preprocessing import MinMaxScaler`

`X_transformed = MinMaxScaler().fit_transform(X)`

### 放到一起

```X_transformed = MinMaxScaler().fit_transform(X_broken)
estimator = KNeighborsClassifier()
In [765]: cross_val_score(estimator, X_transformed, Y, scoring='accuracy')
Out[765]: array([ 0.82905983,  0.77777778,  0.86324786])```

## Pipelines

`from sklearn.pipeline import Pipeline`

```MinMaxScaler
KNeighborsClassifier```

，放到列表作为Pipeline的输入：

```scaling_pipeline = Pipeline([('scale', MinMaxScaler()),
('predict', KNeighborsClassifier())])
In [776]: cross_val_score(scaling_pipeline, X, Y, scoring='accuracy')
Out[776]: array([ 0.82905983,  0.77777778,  0.86324786])```

## 读取数据集

### 使用pandas读取数据集

```import pandas as pd
Out[194]:
Date Start (ET)       Visitor/Neutral  PTS  \
0  Tue Oct 27 2015    8:00 pm       Detroit Pistons  106
1  Tue Oct 27 2015    8:00 pm   Cleveland Cavaliers   95
2  Tue Oct 27 2015   10:30 pm  New Orleans Pelicans   95
3  Wed Oct 28 2015    7:30 pm    Philadelphia 76ers   95
4  Wed Oct 28 2015    7:30 pm         Chicago Bulls  115
Home/Neutral  PTS.1 Unnamed: 6 Unnamed: 7 Notes
0          Atlanta Hawks     94  Box Score        NaN   NaN
1          Chicago Bulls     97  Box Score        NaN   NaN
2  Golden State Warriors    111  Box Score        NaN   NaN
3         Boston Celtics    112  Box Score        NaN   NaN
4          Brooklyn Nets    100  Box Score        NaN   NaN```

### 清洗数据集

```ds = pd.read_csv('basketball.csv', parse_dates=["Date"])
ds.columns = ["Date", "Start (ET)", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Score Type", "Notes"]
Out[200]:
Date Start (ET)          Visitor Team  VisitorPts  \
0 2015-10-27    8:00 pm       Detroit Pistons         106
1 2015-10-27    8:00 pm   Cleveland Cavaliers          95
2 2015-10-27   10:30 pm  New Orleans Pelicans          95
3 2015-10-28    7:30 pm    Philadelphia 76ers          95
4 2015-10-28    7:30 pm         Chicago Bulls         115
Home Team  HomePts        OT? Score Type Notes
0          Atlanta Hawks       94  Box Score        NaN   NaN
1          Chicago Bulls       97  Box Score        NaN   NaN
2  Golden State Warriors      111  Box Score        NaN   NaN
3         Boston Celtics      112  Box Score        NaN   NaN
4          Brooklyn Nets      100  Box Score        NaN   NaN```

```In [202]: ds.dtypes
Out[202]:
Date            datetime64[ns]
Start (ET)              object
Visitor Team            object
VisitorPts               int64
Home Team               object
HomePts                  int64
OT?                     object
Score Type              object
Notes                   object
dtype: object```

### 提取特征

`ds["HomeWin"] = ds["VisitorPts"] < ds["HomePts"]`

`y_true = ds["HomeWin"].values`

```In [355]: ds['HomeWin'].mean()
Out[355]: 0.5942249240121581```

```won_last = defaultdict(bool)
ds["VisitorLastWin"] = False
ds["HomeLastWin"] = False
for idx, row in ds.sort_values(by='Date').iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
row["HomeLastWin"] = won_last[home_team]
ds.set_value(idx, "HomeLastWin", won_last[home_team])
ds.set_value(idx, "VisitorLastWin", won_last[visitor_team])
won_last[home_team] = row["HomeWin"]
won_last[visitor_team] = not row["HomeWin"]```

```In [390]: ds[1000:1005]
Out[390]:
Date Start (ET)           Visitor Team  VisitorPts  \
1000 2016-03-15    7:00 pm         Denver Nuggets         110
1001 2016-03-15    8:30 pm   Los Angeles Clippers          87
1002 2016-03-16    7:00 pm  Oklahoma City Thunder         130
1003 2016-03-16    7:00 pm          Orlando Magic          99
1004 2016-03-16    7:00 pm       Dallas Mavericks          98
Home Team  HomePts        OT? Score Type Notes  HomeWin  \
1000        Orlando Magic      116  Box Score        NaN   NaN     True
1001    San Antonio Spurs      108  Box Score        NaN   NaN     True
1002       Boston Celtics      109  Box Score        NaN   NaN    False
1003    Charlotte Hornets      107  Box Score        NaN   NaN     True
1004  Cleveland Cavaliers       99  Box Score        NaN   NaN     True
HomeLastWin  VisitorLastWin
1000        False           False
1001         True           False
1002        False            True
1003        False            True
1004        False            True```

## 决策树

![Decision Tree](Decision Tree.png)

，也就是仅在预测的时候才工作，而这个是 eager learner
，在训练阶段做工作，在预测阶段需要做的就少了。

scikit-learn实现* Classification and Regression Tree(CART)**算法，也就是
dDecision tree*类所默认使用的，它可以使用分类(categorical)或连续值(continuous)的特征。

### 参数

，也就是决定了构建树的程度，避免过拟合的问题。

)，除去一些提供的信息较少的节点。

min_samples_split: 创建一个节点需要多少样本信息
min_samples_leaf: 保留一个节点需要多少样本信息

Gini impurity
Information gain

### 使用决策树

```from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()```

`predict`

```In [409]: cross_val_score(clf, X_previouswins, y_true, scoring='accuracy').mean()
Out[409]: 0.59422445505386179```

)在数据挖掘中相当重要，选择好的特征可以导致好的结果，甚至比算法还重要。

## 运动结果预测

### 放到一起

，数据我已经下载下来了：standings.csv

```standings = pd.read_csv('standings.csv', skiprows=1)
Out[411]:
Rk                   Team Overall   Home   Road      E      W     A     C  \
0   1  Golden State Warriors   67-15   39-2  28-13   25-5  42-10   9-1   7-3
1   2          Atlanta Hawks   60-22   35-6  25-16  38-14   22-8  12-6  14-4
2   3        Houston Rockets   56-26  30-11  26-15   23-7  33-19   9-1   8-2
3   4   Los Angeles Clippers   56-26  30-11  26-15  19-11  37-15   7-3   6-4
4   5      Memphis Grizzlies   55-27  31-10  24-17  20-10  35-17   8-2   5-5
SE ...    Post   ≤3    ≥10  Oct   Nov   Dec   Jan  Feb   Mar  Apr
0   9-1 ...    25-6  5-3   45-9  1-0  13-2  11-3  12-3  8-3  16-2  6-2
1  12-4 ...   17-11  6-4  30-10  0-1   9-5  14-2  17-0  7-4   9-7  4-3
2   6-4 ...    20-9  8-4  31-14  2-0  11-4   9-5  11-6  7-3  10-6  6-2
3   6-4 ...    21-7  3-5   33-9  2-0   9-5  11-6  11-4  5-6  11-5  7-0
4   7-3 ...   16-13  9-3  26-13  2-0  13-2   8-6  12-4  7-4   9-8  4-3
[5 rows x 24 columns]```

```ds["HomeTeamRanksHigher"] = False
for idx, row in ds.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]
visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]
ds.set_value(idx, "HomeTeamRanksHigher", home_rank < visitor_rank)```

```X_homehigher = ds[[ "HomeTeamRanksHigher", "HomeLastWin", "VisitorLastWin",]].values
clf = DecisionTreeClassifier()
In [460]: cross_val_score(clf, X_homehigher, y_true, scoring='accuracy').mean()
Out[460]: 0.60865985028933201```

```last_match_winner = defaultdict(str)
ds["HomeTeamWonLast"] = False
for idx, row in ds.iterrows():
home_team = row["Home Team"]
visitor_team = row["Visitor Team"]
team = tuple(sorted([home_team, visitor_team]))
home_team_won_last = last_match_winner[team] == home_team
ds.set_value(idx, "HomeTeamWonLast", home_team_won_last)
winner = home_team if row["HomeWin"] else visitor_team
last_match_winner[team] = winner```

```X_lastwinner = ds[[ "HomeTeamWonLast", "HomeTeamRanksHigher", "HomeLastWin", "VisitorLastWin",]].values
clf = DecisionTreeClassifier()
In [479]: cross_val_score(clf, X_lastwinner, y_true, scoring='accuracy').mean()
Out[479]: 0.62158877759401987```

transformer来将参赛队（名）转换成数值数据，给参赛队（名）依次编个号。

```from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
# 为所有参赛队（名）设置编号
encoding.fit(ds["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
# X_teams的每一行表示两个参赛队（名）的编号
X_teams = np.vstack((home_teams, visitor_teams)).T```

transformer来将编号转换成一个个二进制串（长度为编号个数），具体是这样的，有30个参赛队，那幺编号将从0, 1, …, 29，二进制串的长度将会是30，第几个参赛队的那一位将是1，其他位是0。比如7号参赛队，第7位是1，其他位是0，从而形成这个队的二进制串。

```from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()```

```clf = DecisionTreeClassifier()
In [520]: cross_val_score(clf, X_teams, y_true, scoring='accuracy').mean()
Out[520]: 0.62538355124244605```