:bulb: 作者： 韩信子 @ ShowMeAI
:blue_book:数据分析实战系列： www.showmeai.tech/tutorials/4…
:blue_book:机器学习实战系列： www.showmeai.tech/tutorials/4…
:blue_book:本文地址： www.showmeai.tech/article-det…
:loudspeaker: 声明：版权所有，转载请联系平台与作者并注明出处
:loudspeaker: 收藏ShowMeAI查看更多精彩内容

:trophy: 实战数据集下载（百度网盘） ：公众号『ShowMeAI研究中心』回复『 实战 』，或者点击 这里 获取本文 [22]基于Airbnb数据的民宿房价预测模型 『 Airbnb民宿数据 』

:star: ShowMeAI官方GitHub ： github.com/ShowMeAI-Hu…

## :bulb: 数据读取与初探

```import numpy as np
import pandas as pd
from tqdm.notebook import tqdm, trange
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectFromModel
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.inspection import permutation_importance
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)```

```gm_listings = pd.read_csv('gm_listings-2.csv')

`gm_listings.head()`

```gm_listings.shape
# (3584, 74)
gm_listings.columns```

`gm_calendar.head()`

`gm_reviews.head()`

## :pushpin: 字段清洗

```# 删除url字段
def drop_function(df):
df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])

return df
gm_df = drop_function(gm_listings)```

## :pushpin: 缺失值处理

```# 查看缺失值百分比
(gm_df.isnull().sum()/gm_df.shape[0])* 100```

```id                                                0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
name                                              0.000000
neighborhood_overview                            41.266741
host_id                                           0.000000
host_name                                         0.000000
host_since                                        0.000000
host_response_time                               10.212054
host_response_rate                               10.212054
host_acceptance_rate                              5.636161
host_is_superhost                                 0.000000
host_neighbourhood                               91.657366
host_listings_count                               0.000000
host_total_listings_count                         0.000000
host_verifications                                0.000000
host_has_profile_pic                              0.000000
host_identity_verified                            0.000000
neighbourhood_group_cleansed                      0.000000
property_type                                     0.000000
room_type                                         0.000000
accommodates                                      0.000000
bathrooms                                       100.000000
bathrooms_text                                    0.306920
bedrooms                                          4.687500
beds                                              2.120536
amenities                                         0.000000
price                                             0.000000
minimum_nights                                    0.000000
maximum_nights                                    0.000000
minimum_minimum_nights                            0.000000
maximum_minimum_nights                            0.000000
minimum_maximum_nights                            0.000000
maximum_maximum_nights                            0.000000
minimum_nights_avg_ntm                            0.000000
maximum_nights_avg_ntm                            0.000000
calendar_updated                                100.000000
number_of_reviews                                 0.000000
number_of_reviews_ltm                             0.000000
number_of_reviews_l30d                            0.000000
first_review                                     19.810268
last_review                                      19.810268
review_scores_rating                             19.810268
review_scores_accuracy                           20.089286
review_scores_cleanliness                        20.089286
review_scores_checkin                            20.089286
review_scores_communication                      20.089286
review_scores_location                           20.089286
review_scores_value                              20.089286
instant_bookable                                  0.000000
calculated_host_listings_count                    0.000000
calculated_host_listings_count_entire_homes       0.000000
calculated_host_listings_count_private_rooms      0.000000
calculated_host_listings_count_shared_rooms       0.000000
reviews_per_month                                19.810268
dtype: float64```

```# 剔除高缺失比例字段
def drop_function_2(df):
df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])

return df
gm_df = drop_function_2(gm_df)
# 均值填充
def input_mean(df, column_list):
for columns in column_list:
df[columns].fillna(value = df[columns].mean(), inplace=True)

return df
column_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
'review_scores_value', 'reviews_per_month',
'bedrooms', 'beds']
gm_df = input_mean(gm_df, column_list)
# 众数填充
def input_mode(df, column_list):
for columns in column_list:
df[columns].fillna(value = df[columns].mode()[0], inplace=True)

return df
column_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate',
'host_response_rate', 'host_response_time']
gm_df = input_mode(gm_df, column_list)```

## :pushpin: 字段编码

host_is_superhost和 has_availability 等列对应的字符串含义为 true 或 false，我们对其编码替换为0或1。

```gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)
gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)```

`gm_df['host_is_superhost'].value_counts()`

## :pushpin: 字段格式转换

```def string_to_int(df, column):
# 字符串替换清理
df[column] = df[column].str.replace("\$", "")
df[column] = df[column].str.replace(",", "")

# 转为数值型
df[column] = pd.to_numeric(df[column]).astype(int)

return df
gm_df = string_to_int(gm_df, 'price')```

## :pushpin: 列表型字段编码

`host_verifications``amenities` 这样的字段，取值为列表格式，我们对其进行编码处理（用哑变量替换）。

```# 查看列表型取值字段
gm_df_copy = gm_df.copy()

`gm_df_copy['host_verifications'].head()`

```# 哑变量编码
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")
df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")
df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")```

```df_amenities.head()

```# 删除原始字段
gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)```

## :bulb: 数据探索

EDA数据分析部分涉及的工具库，大家可以参考ShowMeAI制作的工具库速查表和教程进行学习和快速使用。

## :pushpin: 哪些街区的房源最多？

`gm_df['neighbourhood_group_cleansed'].value_counts()`

```bar_data = gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()
# 从bar_data构建新的dataframe
bar_data = pd.DataFrame(bar_data).reset_index()
bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()
# 排序
bar_data.sort_values(by='size', ascending=False)
bar_data = bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',
'size':'fraction_of_total'})
#绘图展示
#plt.figure(figsize=(10,10));
bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))
plt.title('Towns with the Most listings');
plt.xlabel('Fraction of Total Listings');```

## :pushpin: 大曼彻斯特地区的 Airbnb 房源价格分布

```gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()
# (143.47600446428572, 8, 7372, 79.0)```

Airbnb 房源的均价为 143 美元，中位价为 79 美元，数据集中观察到的最高价格为 7372 美元。

```# 划分价格档位区间
labels = ['\$0 - \$100', '\$100 - \$200', '\$200 - \$300', '\$300 - \$400', '\$400 - \$500', '\$500 - \$1000', '\$1000 - \$8000']
price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)
# 从价格档构建dataframe
price_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})
# 拼接原始dataframe
gm_df = pd.concat([gm_df, price_clusters], axis=1)
# 分布绘图
def price_cluster_plot(df, column, title):
plt.figure(figsize=(8,6));
yx = sb.histplot(data = df[column]);

total = float(df[column].count())
for p in yx.patches:
width = p.get_width()
height = p.get_height()
yx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
yx.set_title(title);
plt.xticks(rotation=90)

return yx
price_cluster_plot(gm_df, column='price_clusters',
title='Price distribution of Airbnb Listings in the Greater Manchester Area');```

## :pushpin: 最受欢迎的房型是什幺

```# 基于评论量统计排序
ax = gm_df.groupby('property_type').agg(
median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(
by='number_of_reviews', ascending=False).reset_index()

```# 可视化
bx = ax.loc[:10]
bx =sb.boxplot(data =bx, x='median_rating', y='property_type')
bx.set_xlim(4.5, 5)
plt.title('Most Enjoyed Property types');
plt.xlabel('Median Rating');
plt.ylabel('Property Type')```

## :pushpin: 房东与房源分布

```# 持有房源最多的房东
host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()
host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})

`host_df['perc_count'].loc[:10].sum()`

## :pushpin: 大曼彻斯特地区提供的客房类型分布

`gm_df['room_type'].value_counts()`

```# 分布绘图
zx = sb.countplot(data=gm_df, x='room_type')
total = float(gm_df['room_type'].count())
for p in zx.patches:
width = p.get_width()
height = p.get_height()
zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
zx.set_title('Plot showing different type of rooms available');
plt.xlabel('Room')```

## :pushpin: 特征工程

```# 查看此时的数据集

```# 回归数据集
gm_regression_df = gm_df.copy()
# 剔除无用字段
gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])
# 再次查看数据

```# 去除百分号并转换为数值型
gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")
gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")

# convert to int
gm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)
gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)
# 查看转换后结果

bathrooms_text 列包含数字和文本数据的组合，我们对其做一些处理

```# 查看原始字段
gm_regression_df['bathrooms_text'].value_counts()```

```# 切分与数据处理
def split_bathroom(df, column, text, new_column):
df_2 = df[df[column].str.contains(text, case=False)]
df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]
return df
# 应用上述函数
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')
# 查看shared_bath字段
gm_regression_df['shared_bath'].value_counts()```

```# 查看private_bath字段
gm_regression_df['private_bath'].value_counts()```

```gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')
gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)
# 填充缺失值为0
gm_regression_df = gm_regression_df.fillna(0)
gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=0.5)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=0.5)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=0.5)
# 转成数值型
gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)
gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)
gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)
# 查看处理后的字段

```# 序号编码
def encoder(df):
for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:
labels = df[column].astype('category').cat.categories.tolist()
replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
df.replace(replace_map, inplace=True)
print(replace_map)

return df
gm_regression_df = encoder(gm_regression_df)```

```host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')
room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')
# 拼接编码后的字段
gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)
# 剔除原始字段
gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)```

```df_3 = pd.DataFrame(df_amenities.sum())
features = df_3['amenities'][:150].to_list()
amenities_updated = df_amenities.filter(items=(features))
gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)```

```gm_regression_df.shape
# (3584, 198)```

```# 计算VIF
vif_model = gm_regression_df.drop(['price'], axis=1)
vif_df = pd.DataFrame()
vif_df['feature'] = vif_model.columns
vif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]
# 选出小于10的特征
vif_df_new = vif_df[vif_df['VIF']<=10]
feature_list =  vif_df_new['feature'].to_list()
# 选出这些特征对应的数据
model_df = gm_regression_df.filter(items=(feature_list))

```price_col = gm_regression_df['price']
model_df = model_df.join(price_col)```

## :pushpin: 机器学习算法

### 线性回归建模

```def linear_reg(df, test_size=0.3, random_state=42):
'''
构建模型并返回评估结果
输入: 数据dataframe
输出: 特征重要度与评估准则（RMSE与R-squared）
'''

X = df.drop(columns=['price'])
y = df[['price']]
X_columns = X.columns

# 切分训练集与测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)
# 线性回归分类器
clf = LinearRegression()

# 候选参数列表
parameters = {
'n_jobs': [1, 2, 5, 10, 100],
'fit_intercept': [True, False]

}

# 网格搜索交叉验证调参
cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)
cv.fit(X_train,y_train)

# 测试集预估
pred = cv.predict(X_test)

# 模型评估
r2 = r2_score(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = mse **.5

# 最佳参数
best_par = cv.best_params_
coefficients = cv.best_estimator_.coef_

#特征重要度
importance = np.abs(coefficients)
feature_importance = pd.DataFrame(importance, columns=X_columns).T
#feature_importance = feature_importance.T
feature_importance.columns = ['importance']
feature_importance = feature_importance.sort_values('importance', ascending=False)

print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("
")

return feature_importance, rmse, r2

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)```

### 随机森林建模

```# 随机森林建模
def random_forest(df):
'''
构建模型并返回评估结果
输入: 数据dataframe
输出: 特征重要度与评估准则（RMSE与R-squared）
'''

X = df.drop(['price'], axis=1)
X_columns = X.columns

y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# 随机森林模型
clf = RandomForestRegressor()

# 候选参数
parameters = {
'n_estimators': [50, 100, 200, 300, 400],
'max_depth': [2, 3, 4, 5],
'max_depth': [80, 90, 100]

}
# 网格搜索交叉验证调参
cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
model = cv
model.fit(X_train, y_train)
# 测试集预估
pred = model.predict(X_test)
# 模型评估
mse = mean_squared_error(y_test, pred)
rmse = mse**.5
r2 = r2_score(y_test, pred)

# 最佳超参数
best_par = model.best_params_

# 特征重要度
r = permutation_importance(model, X_test, y_test,
n_repeats=10,
random_state=0)
perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])
perm['AVG_Importance'] = r.importances_mean
perm = perm.sort_values(by='AVG_Importance', ascending=False);

return rmse, r2, best_par, perm
# 运行建模
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)```

```Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s
[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s
[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s
[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s
[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s
[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s
[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s
[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s
[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s
[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s
[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s
[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s
[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s
[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s
[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s
[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s
[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s
[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s
[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s
[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s
[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s
[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s
[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s
[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s
[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s
[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s
[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s
[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s
[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s
[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s
[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s
[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s
[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s```

```r_forest_rmse, r_forest_r2
# (218.7941962807868, 0.4208644494689676)```

### GBDT建模

```def GBDT_model(df):
'''
构建模型并返回评估结果
输入: 数据dataframe
输出: 特征重要度与评估准则（RMSE与R-squared）
'''

X = df.drop(['price'], axis=1)
Y = df['price']
X_columns = X.columns
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

parameters = {
'learning_rate': [0.1, 0.5, 1],
'min_samples_leaf': [10, 20, 40 , 60]

}
cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)

model = cv
model.fit(X_train, y_train)
pred = model.predict(X_test)

r2 = r2_score(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = mse**.5

coefficients = model.best_estimator_.feature_importances_
importance = np.abs(coefficients)
feature_importance = pd.DataFrame(importance, index= X_columns,
columns=['importance']).sort_values('importance', ascending=False)[:10]

return r2, mse, rmse, feature_importance
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)
GBDT_r2, GBDT_rmse
# (0.46352992147034244, 210.58063809645563)```

## :pushpin: 效果优化

```# 基于统计方法计算价格边界
q3, q1 = np.percentile(model_df['price'], [75, 25])
iqr = q3 - q1
q3 + (iqr*1.5)
# 得到结果245.0```

```new_model_df = model_df[model_df['price']<245]
# 绘制此时的价格分布
sb.histplot(new_model_df['price'])
plt.title('New price distribution in the dataset')```

```linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)```

## :bulb: 归因分析

```r_feature_importance = r_forest_importance.reset_index()
r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})
r_feature_importance[:15]```

```# 绘制最重要的15个因素
r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));
plt.title('Top 15 Most Imporatant Features');```

accommodates ：可以容纳的最大人数。
bathrooms_new ：非共用或非私人浴室的数量。
minimum_nights ：房源可预定的最少晚数。
number_of_reviews ：总评论数。
Free street parking ：免费路边停车位的存在是影响模型定价的最重要的便利设施。
Gym ：健身房设施。

## 参考资料

:blue_book: 数据科学工具库速查表 | Pandas 速查表 ： www.showmeai.tech/article-det…
:blue_book: 图解数据分析：从入门到精通系列教程 ： www.showmeai.tech/tutorials/3…
:blue_book: 机器学习实战：手把手教你玩转机器学习系列 ： www.showmeai.tech/tutorials/4…
:blue_book: 机器学习实战 | SKLearn入门与简单应用案例 ： www.showmeai.tech/article-det…
:blue_book: 机器学习实战 | SKLearn最全应用指南 ： www.showmeai.tech/article-det…
:blue_book: 机器学习实战 | 机器学习特征工程最全解读 ： www.showmeai.tech/article-det…