# importing libraries
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
2 Multiple Linear Regression
Read section 3.2 of the book before using these notes.
Note that in this course, lecture notes are not sufficient, you must read the book for better understanding. Lecture notes are just implementing the concepts of the book on a dataset, but not explaining the concepts elaborately.
2.1 Multiple Linear Regression
Develop a multiple linear regression model that predicts car price based on engine size, year, mileage, and mpg. Datasets to be used: Car_features_train.csv, Car_prices_train.csv
# Reading datasets
= pd.read_csv('./Datasets/Car_features_train.csv')
trainf = pd.read_csv('./Datasets/Car_prices_train.csv')
trainp = pd.merge(trainf,trainp)
train train.head()
carID | brand | model | year | transmission | mileage | fuelType | tax | mpg | engineSize | price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18473 | bmw | 6 Series | 2020 | Semi-Auto | 11 | Diesel | 145 | 53.3282 | 3.0 | 37980 |
1 | 15064 | bmw | 6 Series | 2019 | Semi-Auto | 10813 | Diesel | 145 | 53.0430 | 3.0 | 33980 |
2 | 18268 | bmw | 6 Series | 2020 | Semi-Auto | 6 | Diesel | 145 | 53.4379 | 3.0 | 36850 |
3 | 18480 | bmw | 6 Series | 2017 | Semi-Auto | 18895 | Diesel | 145 | 51.5140 | 3.0 | 25998 |
4 | 18492 | bmw | 6 Series | 2015 | Automatic | 62953 | Diesel | 160 | 51.4903 | 3.0 | 18990 |
2.1.1 Training the model
#Using the ols function to create an ols object. 'ols' stands for 'Ordinary least squares'
= smf.ols(formula = 'price~year+mileage+mpg+engineSize', data = train)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.660 |
Model: | OLS | Adj. R-squared: | 0.660 |
Method: | Least Squares | F-statistic: | 2410. |
Date: | Mon, 29 Jan 2024 | Prob (F-statistic): | 0.00 |
Time: | 03:10:20 | Log-Likelihood: | -52497. |
No. Observations: | 4960 | AIC: | 1.050e+05 |
Df Residuals: | 4955 | BIC: | 1.050e+05 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -3.661e+06 | 1.49e+05 | -24.593 | 0.000 | -3.95e+06 | -3.37e+06 |
year | 1817.7366 | 73.751 | 24.647 | 0.000 | 1673.151 | 1962.322 |
mileage | -0.1474 | 0.009 | -16.817 | 0.000 | -0.165 | -0.130 |
mpg | -79.3126 | 9.338 | -8.493 | 0.000 | -97.620 | -61.006 |
engineSize | 1.218e+04 | 189.969 | 64.107 | 0.000 | 1.18e+04 | 1.26e+04 |
Omnibus: | 2450.973 | Durbin-Watson: | 0.541 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 31060.548 |
Skew: | 2.045 | Prob(JB): | 0.00 |
Kurtosis: | 14.557 | Cond. No. | 3.83e+07 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.83e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
The model equation is: estimated car price
= -3.661e6 + 1818 * year
-0.15 * mileage
- 79.31 * mpg
+ 12180 * engineSize
The procedure to fit the model using sklearn
will be similar to that in simple linear regression.
= LinearRegression()
model
= train[['year','engineSize','mpg','mileage']] # Slice out the predictors
X_train = train[['price']]
y_train
model.fit(X_train,y_train)
2.1.2 Hypothesis test for a relationship between the response and a subset of predictors
Let us test the hypothesis if there is relationship between car price and the set of predictors: mpg
and year
.
= '(mpg = 0, year = 0)'
hypothesis
# the F test of these two predictors is stat. sig. model.f_test(hypothesis)
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=325.9206432972666, p=1.0499509223096256e-133, df_denom=4.96e+03, df_num=2>
As the \(p\)-value is low, we reject the null hypothesis, i.e., at least one of the predictors among mpg
and year
has a statistically significant relationship with car price.
Predict the car price for the cars in the test dataset. Datasets to be used: Car_features_test.csv, Car_prices_test.csv
= pd.read_csv('./Datasets/Car_features_test.csv')
testf = pd.read_csv('./Datasets/Car_prices_test.csv') testp
2.1.3 Prediction
= model.predict(testf) pred_price
Make a visualization that compares the predicted car prices with the actual car prices
set(font_scale=1.25)
sns.= testp.price, y = pred_price, color = 'orange')
sns.scatterplot(x #In case of a perfect prediction, all the points must lie on the line x = y.
= sns.lineplot(x = [0,testp.price.max()], y = [0,testp.price.max()],color='blue') #Plotting the line x = y.
ax 'Actual price')
plt.xlabel('Predicted price')
plt.ylabel(-10000, 160000])
plt.ylim(['${x:,.0f}')
ax.yaxis.set_major_formatter('${x:,.0f}')
ax.xaxis.set_major_formatter(=20); plt.xticks(rotation
The prediction looks better as compared to the one with simple linear regression. This is because we have four predictors to help explain the variation in car price, instead of just one in the case of simple linear regression. Also, all the predictors have a significant relationship with price as evident from their p-values. Thus, all four of them are contributing in explaining the variation. Note the higher values of \(R^2\) as compared to the one in the case of simple linear regression.
What is the RMSE of the predicted car price?
- pred_price)**2).mean()) np.sqrt(((testp.price
9956.82497993548
What is the residual standard error based on the training data?
np.sqrt(model.mse_resid)
9563.74782917604
trainp.describe()
carID | price | |
---|---|---|
count | 4960.000000 | 4960.000000 |
mean | 15832.446169 | 23469.943750 |
std | 2206.717006 | 16406.714563 |
min | 12002.000000 | 450.000000 |
25% | 13929.250000 | 12000.000000 |
50% | 15840.000000 | 18999.000000 |
75% | 17765.750000 | 30335.750000 |
max | 19629.000000 | 145000.000000 |
= model.fittedvalues, y=model.resid,color = 'orange')
sns.scatterplot(x = sns.lineplot(x = [pred_price.min(),pred_price.max()],y = [0,0],color = 'blue')
ax 'Predicted price')
plt.xlabel('Residual')
plt.ylabel('${x:,.0f}')
ax.yaxis.set_major_formatter('${x:,.0f}')
ax.xaxis.set_major_formatter(=20); plt.xticks(rotation
2.1.4 Effect of adding noisy predictors on \(R^2\)
Will the explained variation (R-squared) in car price always increase if we add a variable?
Should we keep on adding variables as long as the explained variation (R-squared) is increasing?
#Using the ols function to create an ols object. 'ols' stands for 'Ordinary least squares'
1)
np.random.seed('rand_col'] = np.random.rand(train.shape[0])
train[= smf.ols(formula = 'price~year+mileage+mpg+engineSize+rand_col', data = train)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.661 |
Model: | OLS | Adj. R-squared: | 0.660 |
Method: | Least Squares | F-statistic: | 1928. |
Date: | Tue, 27 Dec 2022 | Prob (F-statistic): | 0.00 |
Time: | 01:07:38 | Log-Likelihood: | -52497. |
No. Observations: | 4960 | AIC: | 1.050e+05 |
Df Residuals: | 4954 | BIC: | 1.050e+05 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -3.662e+06 | 1.49e+05 | -24.600 | 0.000 | -3.95e+06 | -3.37e+06 |
year | 1818.1672 | 73.753 | 24.652 | 0.000 | 1673.578 | 1962.756 |
mileage | -0.1474 | 0.009 | -16.809 | 0.000 | -0.165 | -0.130 |
mpg | -79.2837 | 9.338 | -8.490 | 0.000 | -97.591 | -60.976 |
engineSize | 1.218e+04 | 189.972 | 64.109 | 0.000 | 1.18e+04 | 1.26e+04 |
rand_col | 451.1226 | 471.897 | 0.956 | 0.339 | -474.004 | 1376.249 |
Omnibus: | 2451.728 | Durbin-Watson: | 0.541 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 31040.331 |
Skew: | 2.046 | Prob(JB): | 0.00 |
Kurtosis: | 14.552 | Cond. No. | 3.83e+07 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.83e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
Adding a variable with random values to the model (rand_col
) increased the explained variation (\(R^2\)). This is because the model has one more parameter to tune to reduce the residual squared error \((RSS)\). However, the \(p\)-value of rand_col
suggests that its coefficient is zero. Thus, using the model with rand_col
may give poorer performance on unknown data, as compared to the model without rand_col
. This implies that it is not a good idea to blindly add variables in the model to increase \(R^2\).