本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.
公众号:尤而小屋
作者:Peter
编辑:Peter
大家好,我是Peter~
今天给大家分享的是kaggle上一个关于 中风 疾病案例的数据集建模,文章的主要内容参考导图:
原数据地址: www.kaggle.com/datasets/fe…
导入库
数据基本信息
先把数据导进来,查看数据的基本信息
下面我们查看数据基本信息
In [3]:
df.shape
Out[3]:
(5110, 12)
In [4]:
df.dtypes
Out[4]:
id int64 gender object age float64 hypertension int64 heart_disease int64 ever_married object # 字符型 work_type object Residence_type object avg_glucose_level float64 bmi float64 smoking_status object stroke int64 dtype: object
In [5]:
df.describe() # 描述统计信息
Out[5]:
字段分布
gender统计
In [6]:
plt.figure(1, figsize=(12,5)) sns.countplot(y="gender", data=df) plt.show()
age分布
In [7]:
px.violin(y=df["age"])
fig = px.histogram(df, x="age", color_discrete_sequence=['firebrick']) fig.show()
ever_married
In [9]:
plt.figure(1, figsize=(12,5)) sns.countplot(y="ever_married", data=df) plt.show()
本数据集中的结婚人士大约是未结婚的两倍。
work-type
查看不同工作状态的人员数量
In [10]:
plt.figure(1, figsize=(12,8)) sns.countplot(y="work_type", data=df) plt.show()
Residence_type
In [11]:
plt.figure(1, figsize=(12,8)) sns.countplot(y="Residence_type", data=df) plt.show()
avg_glucose_level
血糖水平的分布
fig = px.histogram(df, x="avg_glucose_level", color_discrete_sequence=['firebrick']) fig.show()
可以看到大部分人的血糖还是在100以下,说明是正常的
bmi
bmi指标的分布情况
bmi指标的均值大约在28左右,呈现一定的正态分布
smoking_status
抽烟情况的统计
plt.figure(1, figsize=(12,8)) sns.countplot(y="smoking_status", data=df) plt.show()
可以看到抽烟或者曾经抽烟的人相对来说是少一些的
缺失值情况
缺失值统计
df.isnull().sum()
id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 201 smoking_status 0 stroke 0 dtype: int64
201 / len(df) # 缺失比例
0.03933463796477495
缺失值可视化
缺失值处理
使用决策树回归来预测缺失值的BMI值:通过年龄、性别和现有的bmi值来进行预测填充
dt_bmi = Pipeline(steps=[("scale",StandardScaler()), # 数据标准化 ("lr",DecisionTreeRegressor(random_state=42)) ])
取出3个指标来进行预测填充:
X = df[["age","gender","bmi"]].copy() dic = {"Male":0, "Female":1, "Other":-1} X["gender"] = X["gender"].map(dic).astype(np.uint8) X.head()
取出非缺失值的部分进行训练:
# 缺失值部分 missing = X[X.bmi.isna()] # 非缺失值部分 X = X[~X.bmi.isna()] y = X.pop("bmi")
# 模型训练 dt_bmi.fit(X,y)
Pipeline(steps=[('scale', StandardScaler()), ('lr', DecisionTreeRegressor(random_state=42))])
In [23]:
# 模型预测 y_pred = dt_bmi.predict(missing[["age","gender"]]) y_pred[:5]
Out[23]:
array([29.87948718, 30.55609756, 27.24722222, 30.84186047, 33.14666667])
将预测的值转成Series,并且注意索引号:
predict_bmi = pd.Series(y_pred, index=missing.index) predict_bmi
1 29.879487 8 30.556098 13 27.247222 19 30.841860 27 33.146667 ... 5039 32.716000 5048 28.313636 5093 31.459322 5099 28.313636 5105 28.476923 Length: 201, dtype: float64
填充到原来的df数据中:
df.loc[missing.index, "bmi"] = predict_bmi
进行上面的预测和填充之后,我们再次查看缺失值情况,发现已经没有任何缺失值:
df.isnull().sum()
id 0 gender 0 age 0 hypertension 0 heart_disease 0 ever_married 0 work_type 0 Residence_type 0 avg_glucose_level 0 bmi 0 smoking_status 0 stroke 0 dtype: int64
数据EDA
variables = [variable for variable in df.columns if variable not in ['id','stroke']] # 除去id号和是否中风外的全部字段 variables
['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status']
连续型变量
几点结论:
年龄age:整体分布比较均衡,不同年龄段的人数差异小
血糖水平:主要集中在100以下
bmi指标:呈现一定的正态分布
中风和未中风
上面我们查看了连续型变量的分布情况;可以看到bmi呈现明显的左偏态的分布。下面我们对比中风和未中风的情况:
从3个密度图中能够观察到:从上面的密度图中可以看出来:对于是否中风,年龄age是一个最主要的因素
对比不同年龄段的血糖和BMI指数
px.scatter(df,x="age", y="avg_glucose_level", color="stroke", trendline='ols' )
年龄和血糖、bmi关系
px.scatter(df,x="age", y="bmi", color="stroke", trendline='ols' )
年龄和患病几率
从散点分布图中看到:年龄可能真的是一个比较重要的因素,和BMI以及平均的血糖水平有着一定的关系。
可能随着年龄的增长,风险在增加。果真如此吗?
上面的图形说明了两点:
- 年龄越大,中风的几率的确越来越高
- 中风的几率是非常低的(y轴的值很低),这是由于中风和未中风的样本不均衡造成的
原数据5000个样本中只有249个中风样本,比例接近1:20
样本不均衡
整体属性分布
首先我们剔除gender中为Other的情况
In [34]:
str_only = df[df['stroke'] == 1] # 中风 no_str_only = df[df['stroke'] == 0] # 未中风
In [35]:
len(str_only)
Out[35]:
In [36]:
# 剔除other no_str_only = no_str_only[(no_str_only['gender'] != 'Other')]
比较在不同的属性下中风和未中风的情况:
建模
模型baseline
In [38]:
len(str_only)
Out[38]:
In [39]:
249 / len(df)
Out[39]:
0.0487279843444227
说明总共有249个人是中风的。本数据的总人数是len(df),根据下面的表达式能够得到本次模型的baseline。
也就说,对于阳性中风患者的召回率,一个好的目标是4.8%。
字段编码
对4个字符型的字段进行编码工作:
In [40]:
df['gender'] = df['gender'].replace({'Male':0, 'Female':1, 'Other':-1} ).astype(np.uint8) df['Residence_type'] = df['Residence_type'].map({'Rural':0, 'Urban':1} ).astype(np.uint8) df['work_type'] = df['work_type'].map({'Private':0, 'Self-employed':1, 'Govt_job':2, 'children':-1, 'Never_worked':-2} ).astype(np.uint8) df['ever_married'] = df['ever_married'].map({'No':0,'Yes':1}).astype(np.uint8) df.head()
抽烟状态的独热码转换:
In [41]:
df["smoking_status"].value_counts()
Out[41]:
never smoked 1892 Unknown 1544 formerly smoked 885 smokes 789 Name: smoking_status, dtype: int64
In [42]:
df = df.join(pd.get_dummies(df["smoking_status"])) df.drop("smoking_status",axis=1,inplace=True)
数据分割
In [43]:
# 选取特征 X = df.drop("stroke",axis=1) # 目标变量 y = df['stroke'] from sklearn.model_selection import train_test_split # 3-7比例 X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)
上采样
前文中提到,本案例中风和未中风的数据比例接近1:20,在这里我们采样基于SMOTE的上采样方法
In [44]:
oversample = SMOTE() X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train.ravel())
In [45]:
len(y_train_smote)
Out[45]:
In [46]:
len(X_train_smote)
Out[46]:
建模
采用3种不同的分类模型来建立模型:Random Forest, SVM, Logisitc Regression
In [47]:
rf_pipeline = Pipeline(steps = [('scale',StandardScaler()), # 标准化 ('RF',RandomForestClassifier(random_state=42))] # 模型 ) svm_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('SVM',SVC(random_state=42))]) logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('LR',LogisticRegression(random_state=42))])
10折交叉验证
3种模型得分对比
In [49]:
print('随机森林:', rf_cv.mean()) print('支持向量机:',svm_cv.mean()) print('逻辑回归:', logreg_cv.mean()) 随机森林: 0.9628909366701726 支持向量机: 0.9363667907023254 逻辑回归: 0.8859930523017683
很明显:随机森林表现的最好!
模型训练fit
In [50]:
rf_pipeline.fit(X_train_smote,y_train_smote) svm_pipeline.fit(X_train_smote,y_train_smote) logreg_pipeline.fit(X_train_smote,y_train_smote)
Out[50]:
Pipeline(steps=[('scale', StandardScaler()), ('LR', LogisticRegression(random_state=42))])
In [51]:
# 3种模型预测 rf_pred =rf_pipeline.predict(X_test) svm_pred = svm_pipeline.predict(X_test) logreg_pred = logreg_pipeline.predict(X_test)
评价指标
In [52]:
# 1、混淆矩阵 rf_cm = confusion_matrix(y_test, rf_pred ) svm_cm = confusion_matrix(y_test, svm_pred) logreg_cm = confusion_matrix(y_test, logreg_pred)
In [53]:
print(rf_cm) print("----") print(svm_cm) print("----") print(logreg_cm) [[3338 66] [ 164 9]] ---- [[3196 208] [ 148 25]] ---- [[3138 266] [ 116 57]]
print('RF mean :',rf_f1) print('SVM mean :',svm_f1) print('LR mean :',logreg_f1) RF mean : 0.07258064516129033 SVM mean : 0.1231527093596059 LR mean : 0.22983870967741934
随机森林模型的分类报告:
from sklearn.metrics import plot_confusion_matrix, classification_report print(classification_report(y_test,rf_pred)) print('Accuracy Score: ',accuracy_score(y_test,rf_pred)) precision recall f1-score support 0 0.95 0.98 0.97 3404 1 0.12 0.05 0.07 173 accuracy 0.94 3577 macro avg 0.54 0.52 0.52 3577 weighted avg 0.91 0.94 0.92 3577 Accuracy Score: 0.9357003075202683
随机森林模型调参
基于网格搜索的参数调优:
from sklearn.model_selection import GridSearchCV n_estimators =[64,100,128,200] max_features = [2,3,5,7] bootstrap = [True,False] param_grid = {'n_estimators':n_estimators, 'max_features':max_features, 'bootstrap':bootstrap} rfc = RandomForestClassifier()
grid = GridSearchCV(rfc,param_grid) grid.fit(X_train,y_train)
GridSearchCV(estimator=RandomForestClassifier(), param_grid={'bootstrap': [True, False], 'max_features': [2, 3, 5, 7], 'n_estimators': [64, 100, 128, 200]})
grid.best_params_ # 找到最优的参数
{'bootstrap': False, 'max_features': 3, 'n_estimators': 200}
# 再次建立随机森林模型 rfc = RandomForestClassifier( max_features=3, n_estimators=200, bootstrap=False) rfc.fit(X_train_smote,y_train_smote) rfc_tuned_pred = rfc.predict(X_test)
# 新的分类报告得分 print(classification_report(y_test,rfc_tuned_pred)) print('Accuracy Score: ',accuracy_score(y_test,rfc_tuned_pred)) print('F1 Score: ',f1_score(y_test,rfc_tuned_pred)) precision recall f1-score support 0 0.95 0.98 0.97 3404 1 0.05 0.02 0.03 173 accuracy 0.94 3577 macro avg 0.50 0.50 0.50 3577 weighted avg 0.91 0.94 0.92 3577 Accuracy Score: 0.9362594352809617 F1 Score: 0.025641025641025644
逻辑回归模型调参
penalty = ['l1','l2'] C = [0.001, 0.01, 0.1, 1, 10, 100] log_param_grid = {'penalty': penalty, 'C': C} logreg = LogisticRegression() grid = GridSearchCV(logreg,log_param_grid)
grid.fit(X_train_smote,y_train_smote)
GridSearchCV(estimator=LogisticRegression(), param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']})
grid.best_params_
{'C': 1, 'penalty': 'l2'}
logreg_pipeline = Pipeline(steps = [('scale',StandardScaler()), ('LR',LogisticRegression(C=1,penalty='l2',random_state=42))]) logreg_pipeline.fit(X_train_smote,y_train_smote)
Out[65]:
Pipeline(steps=[('scale', StandardScaler()), ('LR', LogisticRegression(C=1, random_state=42))])
In [66]:
logreg_new_pred = logreg_pipeline.predict(X_test) # 新预测
In [67]:
print(classification_report(y_test,logreg_new_pred)) print('Accuracy Score: ',accuracy_score(y_test,logreg_new_pred)) print('F1 Score: ',f1_score(y_test,logreg_new_pred)) precision recall f1-score support 0 0.96 0.92 0.94 3404 1 0.18 0.33 0.23 173 accuracy 0.89 3577 macro avg 0.57 0.63 0.59 3577 weighted avg 0.93 0.89 0.91 3577 Accuracy Score: 0.8932065977075762 F1 Score: 0.22983870967741934
支持向量机调参
In [68]:
svm_param_grid = { 'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']} svm = SVC(random_state=42) grid = GridSearchCV(svm, svm_param_grid)
In [69]:
grid.fit(X_train_smote,y_train_smote)
Out[69]:
GridSearchCV(estimator=SVC(random_state=42), param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']})
In [70]:
grid.best_params_
Out[70]:
{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
In [71]:
svm_pipeline = Pipeline(steps = [('scale',StandardScaler()),('SVM',SVC(C=100,gamma=0.0001,kernel='rbf',random_state=42))]) svm_pipeline.fit(X_train_smote,y_train_smote) svm_tuned_pred = svm_pipeline.predict(X_test)
In [72]:
print(classification_report(y_test,svm_tuned_pred)) print('Accuracy Score: ',accuracy_score(y_test,svm_tuned_pred)) print('F1 Score: ',f1_score(y_test,svm_tuned_pred)) precision recall f1-score support 0 0.96 0.93 0.94 3404 1 0.16 0.27 0.20 173 accuracy 0.90 3577 macro avg 0.56 0.60 0.57 3577 weighted avg 0.92 0.90 0.91 3577 Accuracy Score: 0.8951635448700028 F1 Score: 0.19700214132762314
结论
- 在交叉验证的过程中,随机森林表现的最好。
- 3种模型的对比:随机森林的精度最好,但是F1-score缺失最低的
- 模型可能存在的特点:更能预测哪些人将会中风,而不是哪些人不会中风
Be First to Comment