## 线性回归的几个前置条件

1. Linearity 线性 （数据呈线性关系）

1. Homoscedasticity 同方差性（数据要有相同的方差）

1. Multivariate normality 多元正态分布 （数据要呈现多元正态分布）

1. Independence of errors 误差独立 （各个维度上的误差相互独立）

1. Lack of multicollinearity 无多重共线性 （没有一个自变量和另外的自变量存在线性关系）

R&D Spend Administration Marketing Spend State Profit
165349.2 136897.8 471784.1 New York 192261.83
162597.7 151377.59 443898.53 California 191792.06
153441.51 101145.55 407934.54 Florida 191050.39
144372.41 118671.85 383199.62 New York 182901.99

New York California Florida
1 0 0
0 1 0
0 0 1
1 0 0

\$\$\$\$

y = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_3 + b_4

## 如何构建一个多元线性回归模型

1. All-in

1. Backward Elimination 反向淘汰

1. Forward Selection 顺向选择

1. Bidirectional Elimination 双向淘汰

1. Score Comparison 信息量比较

### All-in

Prior Knowledge 我们已经提前知道了所有的信息，知道这些所有自变量都会对模型结果有影响
You have to 上级需要你必须使用这些自变量
Preparing for Backward Elimination 为反向淘汰做准备

All-in这个方法很简单，但一般都是一些特殊情况下或者一些外力干涉时才会使用，这种方法不作推荐。

## Coding

```import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm
import numpy as np
data_path = '../data/50_Startups.csv'
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)```

```# Fitting Multiple Linear Regression to the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)```

`X_train = np.append(arr=np.ones((40, 1)), values=X_train, axis=1)`

`X_opt = X_train[:, [0, 1, 2, 3, 4, 5]]`

```regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
regressor_OLS.summary()```

```X_opt = X_train[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X_train[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X_train[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X_train[:, [0, 3]]
regressor_OLS = sm.OLS(endog=y_train, exog=X_opt).fit()
regressor_OLS.summary()```