import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
3 Variable interactions and transformations
Read sections 3.3.1 and 3.3.2 of the book before using these notes.
Note that in this course, lecture notes are not sufficient, you must read the book for better understanding. Lecture notes are just implementing the concepts of the book on a dataset, but not explaining the concepts elaborately.
3.1 Variable interactions
= pd.read_csv('./Datasets/Car_features_train.csv')
trainf = pd.read_csv('./Datasets/Car_prices_train.csv')
trainp = pd.read_csv('./Datasets/Car_features_test.csv')
testf = pd.read_csv('./Datasets/Car_prices_test.csv')
testp = pd.merge(trainf,trainp)
train = pd.merge(testf,testp)
test train.head()
carID | brand | model | year | transmission | mileage | fuelType | tax | mpg | engineSize | price | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18473 | bmw | 6 Series | 2020 | Semi-Auto | 11 | Diesel | 145 | 53.3282 | 3.0 | 37980 |
1 | 15064 | bmw | 6 Series | 2019 | Semi-Auto | 10813 | Diesel | 145 | 53.0430 | 3.0 | 33980 |
2 | 18268 | bmw | 6 Series | 2020 | Semi-Auto | 6 | Diesel | 145 | 53.4379 | 3.0 | 36850 |
3 | 18480 | bmw | 6 Series | 2017 | Semi-Auto | 18895 | Diesel | 145 | 51.5140 | 3.0 | 25998 |
4 | 18492 | bmw | 6 Series | 2015 | Automatic | 62953 | Diesel | 160 | 51.4903 | 3.0 | 18990 |
Until now, we have assumed that the association between a predictor \(X_j\) and response \(Y\) does not depend on the value of other predictors. For example, the multiple linear regression model that we developed in Chapter 2 assumes that the average increase in price associated with a unit increase in engineSize is always $12,180, regardless of the value of other predictors. However, this assumption may be incorrect.
3.1.1 Variable interaction between continuous predictors
We can relax this assumption by considering another predictor, called an interaction term. Let us assume that the average increase in price
associated with a one-unit increase in engineSize
depends on the model year
of the car. In other words, there is an interaction between engineSize
and year
. This interaction can be included as a predictor, which is the product of engineSize
and year
. Note that there are several possible interactions that we can consider. Here the interaction between engineSize
and year
is just an example.
#Considering interaction between engineSize and year
= smf.ols(formula = 'price~year*engineSize+mileage+mpg', data = train)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.682 |
Model: | OLS | Adj. R-squared: | 0.681 |
Method: | Least Squares | F-statistic: | 2121. |
Date: | Tue, 24 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 15:28:11 | Log-Likelihood: | -52338. |
No. Observations: | 4960 | AIC: | 1.047e+05 |
Df Residuals: | 4954 | BIC: | 1.047e+05 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 5.606e+05 | 2.74e+05 | 2.048 | 0.041 | 2.4e+04 | 1.1e+06 |
year | -275.3833 | 135.695 | -2.029 | 0.042 | -541.405 | -9.361 |
engineSize | -1.796e+06 | 9.97e+04 | -18.019 | 0.000 | -1.99e+06 | -1.6e+06 |
year:engineSize | 896.7687 | 49.431 | 18.142 | 0.000 | 799.861 | 993.676 |
mileage | -0.1525 | 0.008 | -17.954 | 0.000 | -0.169 | -0.136 |
mpg | -84.3417 | 9.048 | -9.322 | 0.000 | -102.079 | -66.604 |
Omnibus: | 2330.413 | Durbin-Watson: | 0.524 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 29977.437 |
Skew: | 1.908 | Prob(JB): | 0.00 |
Kurtosis: | 14.423 | Cond. No. | 7.66e+07 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.66e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
Note that the R-squared has increased as compared to the model in Chapter 2 since we added a predictor.
The model equation is:
\[\begin{equation} price = \beta_0 + \beta_1*year + \beta_2*engineSize + \beta_3*(year * engineSize) + \beta4*mileage + \beta_5*mpg, \end{equation}\]or
\[\begin{equation} price = \beta_0 + \beta_1*year + (\beta_2+\beta_3*year)*engineSize + \beta4*mileage + \beta_5*mpg, \end{equation}\]or
\[\begin{equation} price = \beta_0 + \beta_1*year + \tilde \beta*engineSize + \beta4*mileage + \beta_5*mpg, \end{equation}\]
Since \(\tilde \beta\) is a function of year
, the association between engineSize
and price
is no longer a constant. A change in the value of year
will change the association between price
and engineSize
.
Substituting the values of the coefficients:
price
= 5.606e5 - 275.3833year
+ (-1.796e6+896.7687year
)engineSize
-0.1525mileage
-84.3417mpg
Thus, for cars launched in the year 2010, the average increase in price for one liter increase in engine size is -1.796e6 + 896.7687 * 2010 \(\approx\) \$6,500, assuming all the other predictors are constant. However, for cars launched in the year 2020, the average increase in price for one liter increase in engine size is -1.796e6 + 896.7687*2020 \(\approx\) \$15,500 , assuming all the other predictors are constant.
Similarly, the equation can be re-arranged as:
price
= 5.606e5 +(-275.3833+896.7687engineSize
)year
-1.796e6engineSize
-0.1525mileage
-84.3417*mpg
Thus, for cars with an engine size of 2 litres, the average increase in price for a one year newer model is -275.3833+896.7687 * 2 \(\approx\) \$1500, assuming all the other predictors are constant. However, for cars with an engine size of 3 litres, the average increase in price for a one year newer model is -275.3833+896.7687 * 3 \(\approx\) \$2400, assuming all the other predictors are constant.
#Computing the RMSE of the model with the interaction term
= model.predict(testf)
pred_price - pred_price)**2).mean()) np.sqrt(((testp.price
9423.598872501092
Note that the RMSE is lower than that of the model in Chapter 2. This is because the interaction term between engineSize
and year
is significant and relaxes the assumption of constant association between price and engine size, and between price and year. This added flexibility makes the model better fit the data. Caution: Too much flexibility may lead to overfitting!
Note that interaction terms corresponding to other variable pairs, and higher order interaction terms (such as those containing 3 or 4 variables) may also be significant and improve the model fit & thereby the prediction accuracy of the model.
3.1.2 Including qualitative predictors in the model
Let us develop a model for predicting price
based on engineSize
and the qualitative predictor transmission
.
#checking the distribution of values of transmission
train.transmission.value_counts()
Manual 1948
Automatic 1660
Semi-Auto 1351
Other 1
Name: transmission, dtype: int64
Note that the Other category of the variable transmission contains only a single observation, which is likely to be insufficient to train the model. We’ll remove that observation from the training data. Another option may be to combine the observation in the Other category with the nearest category, and keep it in the data.
= train[train.transmission!='Other'] train_updated
= smf.ols(formula = 'price ~ engineSize + transmission', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.459 |
Model: | OLS | Adj. R-squared: | 0.458 |
Method: | Least Squares | F-statistic: | 1400. |
Date: | Tue, 24 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 15:28:21 | Log-Likelihood: | -53644. |
No. Observations: | 4959 | AIC: | 1.073e+05 |
Df Residuals: | 4955 | BIC: | 1.073e+05 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 3042.6765 | 661.190 | 4.602 | 0.000 | 1746.451 | 4338.902 |
transmission[T.Manual] | -6770.6165 | 442.116 | -15.314 | 0.000 | -7637.360 | -5903.873 |
transmission[T.Semi-Auto] | 4994.3112 | 442.989 | 11.274 | 0.000 | 4125.857 | 5862.765 |
engineSize | 1.023e+04 | 247.485 | 41.323 | 0.000 | 9741.581 | 1.07e+04 |
Omnibus: | 1575.518 | Durbin-Watson: | 0.579 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11006.609 |
Skew: | 1.334 | Prob(JB): | 0.00 |
Kurtosis: | 9.793 | Cond. No. | 11.4 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Note that there is no coefficient for the Automatic level of the variable Transmission
. If a car doesn’t have Manual or Semi-Automatic transmission, then it has an Automatic transmission. Thus, the coefficient of Automatic will be redundant, and the dummy variable corresponding to Automatic transmission is dropped from the model.
The level of the categorical variable that is dropped from the model is called the baseline level. Here Automatic transmission is the baseline level. The coefficients of other levels of transmission
should be interpreted with respect to the baseline level.
Q: Interpret the intercept term
Ans: For the hypothetical scenario of a car with zero engine size and Automatic transmission, the estimated mean car price is \(\approx\) \$3042.
Q: Interpret the coefficient of transmission[T.Manual]
Ans: The estimated mean price of a car with manual transmission is \(\approx\) \$6770 less than that of a car with Automatic transmission.
Let us visualize the developed model.
#Visualizing the developed model
"figure.figsize"] = (9,6)
plt.rcParams[set(font_scale = 1.3)
sns.= np.linspace(train_updated.engineSize.min(),train_updated.engineSize.max(),100)
x = sns.lineplot(x = x, y = model.params['engineSize']*x+model.params['Intercept'], color = 'red')
ax = x, y = model.params['engineSize']*x+model.params['Intercept']+model.params['transmission[T.Semi-Auto]'], color = 'blue')
sns.lineplot(x = x, y = model.params['engineSize']*x+model.params['Intercept']+model.params['transmission[T.Manual]'], color = 'green')
sns.lineplot(x =["Automatic","Semi-Automatic", "Manual"])
plt.legend(labels'Engine size (in litre)')
plt.xlabel('Predicted car price')
plt.ylabel('${x:,.0f}') ax.yaxis.set_major_formatter(
Based on the developed model, for a given engine size, the car with a semi-automatic transmission is estimated to be the most expensive on average, while the car with a manual transmission is estimated to be the least expensive on average.
Changing the baseline level: By default, the baseline level is chosen as the one that comes first if the levels are arranged in alphabetical order. However, you can change the baseline level by specifying one explicitly.
Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. You may refer to this section in the patsy documentation to specify a particular level of the categorical variable as the baseline.
For example, suppose we wish to change the baseline level to Manual transmission. We can specify this in the formula as follows:
= smf.ols(formula = 'price~engineSize+C(transmission, Treatment("Manual"))', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.459 |
Model: | OLS | Adj. R-squared: | 0.458 |
Method: | Least Squares | F-statistic: | 1400. |
Date: | Tue, 24 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 15:28:39 | Log-Likelihood: | -53644. |
No. Observations: | 4959 | AIC: | 1.073e+05 |
Df Residuals: | 4955 | BIC: | 1.073e+05 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | -3727.9400 | 492.917 | -7.563 | 0.000 | -4694.275 | -2761.605 |
C(transmission, Treatment("Manual"))[T.Automatic] | 6770.6165 | 442.116 | 15.314 | 0.000 | 5903.873 | 7637.360 |
C(transmission, Treatment("Manual"))[T.Semi-Auto] | 1.176e+04 | 473.110 | 24.867 | 0.000 | 1.08e+04 | 1.27e+04 |
engineSize | 1.023e+04 | 247.485 | 41.323 | 0.000 | 9741.581 | 1.07e+04 |
Omnibus: | 1575.518 | Durbin-Watson: | 0.579 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11006.609 |
Skew: | 1.334 | Prob(JB): | 0.00 |
Kurtosis: | 9.793 | Cond. No. | 8.62 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
3.1.3 Including qualitative predictors and their interaction with continuous predictors in the model
Note that the qualitative predictor leads to fitting 3 parallel lines to the data, as there are 3 categories.
However, note that we have made the constant association assumption. The fact that the lines are parallel means that the average increase in car price for one litre increase in engine size does not depend on the type of transmission. This represents a potentially serious limitation of the model, since in fact a change in engine size may have a very different association on the price of an automatic car versus a semi-automatic or manual car.
This limitation can be addressed by adding an interaction variable, which is the product of engineSize
and the dummy variables for semi-automatic and manual transmissions.
#Using the ols function to create an ols object. 'ols' stands for 'Ordinary least squares'
= smf.ols(formula = 'price~engineSize*transmission', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.479 |
Model: | OLS | Adj. R-squared: | 0.478 |
Method: | Least Squares | F-statistic: | 909.9 |
Date: | Sun, 22 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 22:55:55 | Log-Likelihood: | -53550. |
No. Observations: | 4959 | AIC: | 1.071e+05 |
Df Residuals: | 4953 | BIC: | 1.072e+05 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 3754.7238 | 895.221 | 4.194 | 0.000 | 1999.695 | 5509.753 |
transmission[T.Manual] | 1768.5856 | 1294.071 | 1.367 | 0.172 | -768.366 | 4305.538 |
transmission[T.Semi-Auto] | -5282.7164 | 1416.472 | -3.729 | 0.000 | -8059.628 | -2505.805 |
engineSize | 9928.6082 | 354.511 | 28.006 | 0.000 | 9233.610 | 1.06e+04 |
engineSize:transmission[T.Manual] | -5285.9059 | 646.175 | -8.180 | 0.000 | -6552.695 | -4019.117 |
engineSize:transmission[T.Semi-Auto] | 4162.2428 | 552.597 | 7.532 | 0.000 | 3078.908 | 5245.578 |
Omnibus: | 1379.846 | Durbin-Watson: | 0.622 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 9799.471 |
Skew: | 1.139 | Prob(JB): | 0.00 |
Kurtosis: | 9.499 | Cond. No. | 30.8 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The model equation for the model with interactions is:
Automatic transmission: price
= 3754.7238 + 9928.6082 * engineSize
,
Semi-Automatic transmission: price
= 3754.7238 + 9928.6082 * engineSize
+ (-5282.7164+4162.2428*engineSize
),
Manual transmission: price
= 3754.7238 + 9928.6082 * engineSize
+(1768.5856-5285.9059 * engineSize
),
or
Automatic transmission: price
= 3754.7238 + 9928.6082 * engineSize
,
Semi-Automatic transmission: price
= -1527 + 7046 * engineSize
,
Manual transmission: price
= 5523 + 4642 * engineSize
Q: Interpret the coefficient of manual tranmission, i.e., the coefficient of transmission[T.Manual]
.
A: For a hypothetical scenario of zero engine size, the estimated mean price
of a car with Manual transmission
is \(\approx\) \$1768 more than the estimated mean price
of a car with Automatic transmission
.
Q: Interpret the coefficient of the interaction between engine size and manual transmission, i.e., the coefficient of engineSize:transmission[T.Manual]
.
A: For a unit (or a litre) increase in engineSize
, the increase in estimated mean price
of a car with Manual transmission is \(\approx\) \$5285 less than the increase in estimated mean price
of a car with Automatic transmission.
#Visualizing the developed model with interaction terms
"figure.figsize"] = (9,6)
plt.rcParams[set(font_scale = 1.3)
sns.= np.linspace(train_updated.engineSize.min(),train_updated.engineSize.max(),100)
x = sns.lineplot(x = x, y = model.params['engineSize']*x+model.params['Intercept'], label='Automatic', color = 'red')
ax 'engineSize']+model.params['engineSize:transmission[T.Semi-Auto]'])*x+model.params['Intercept']+model.params['transmission[T.Semi-Auto]'], '-b', label='Semi-Automatic')
plt.plot(x, (model.params['engineSize']+model.params['engineSize:transmission[T.Manual]'])*x+model.params['Intercept']+model.params['transmission[T.Manual]'], '-g', label='Manual')
plt.plot(x, (model.params[='upper left')
plt.legend(loc'Engine size (in litre)')
plt.xlabel('Predicted car price')
plt.ylabel('${x:,.0f}') ax.yaxis.set_major_formatter(
Note the interaction term adds flexibility to the model.
The slope of the regression line for semi-automatic cars is the largest. This suggests that increase in engine size is associated with a higher increase in car price for semi-automatic cars, as compared to other cars.
3.2 Variable transformations
So far we have considered only a linear relationship between the predictors and the response. However, the relationship may be non-linear.
Consider the regression plot of price
on mileage
.
= sns.regplot(x = train_updated.mileage, y =train_updated.price,color = 'orange', line_kws = {'color':'blue'})
ax 'Mileage')
plt.xlabel('Predicted car price')
plt.ylabel('${x:,.0f}')
ax.yaxis.set_major_formatter('{x:,.0f}') ax.xaxis.set_major_formatter(
#R-squared of the model with just mileage
= smf.ols('price~mileage', data = train_updated).fit()
model model.rsquared
0.22928048993376182
From the first scatterplot, we see that the relationship between price
and mileage
doesn’t seem to be linear, as the points do not lie on a straight line. Also, we see the regression line (or the curve), which is the best fit line doesn’t seem to fit the points well. However, price
on average seems to decrease with mileage
, albeit in a non-linear manner.
3.2.1 Quadratic transformation
So, we guess that if we model price as a quadratic function of mileage
, the model may better fit the points (or the curve may better fit the points). Let us transform the predictor mileage
to include \(mileage^2\) (i.e., perform a quadratic transformation on the predictor).
#Including mileage squared as a predictor and developing the model
= smf.ols(formula = 'price~mileage+I(mileage**2)', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.271 |
Model: | OLS | Adj. R-squared: | 0.271 |
Method: | Least Squares | F-statistic: | 920.6 |
Date: | Sun, 22 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 23:26:05 | Log-Likelihood: | -54382. |
No. Observations: | 4959 | AIC: | 1.088e+05 |
Df Residuals: | 4956 | BIC: | 1.088e+05 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 3.44e+04 | 332.710 | 103.382 | 0.000 | 3.37e+04 | 3.5e+04 |
mileage | -0.5662 | 0.017 | -33.940 | 0.000 | -0.599 | -0.534 |
I(mileage ** 2) | 2.629e-06 | 1.56e-07 | 16.813 | 0.000 | 2.32e-06 | 2.94e-06 |
Omnibus: | 2362.973 | Durbin-Watson: | 0.325 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 22427.952 |
Skew: | 2.052 | Prob(JB): | 0.00 |
Kurtosis: | 12.576 | Cond. No. | 4.81e+09 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.81e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
Note that in the formula specified within the ols()
function, the I()
operator isolates or insulates the contents within I(…) from the regular formula operators. Without the I()
operator, mileage**2
will be treated as the interaction of mileage
with itself, which is mileage
. Thus, to add the square of mileage
as a separate predictor, we need to use the I()
operator.
Let us visualize the model fit with the quadratic transformation of the predictor - mileage
.
#Visualizing the regression line with the model consisting of the quadratic transformation of the predictor - mileage
= model.predict(train_updated)
pred_price = sns.scatterplot(x = 'mileage', y = 'price', data = train_updated, color = 'orange')
ax = train_updated.mileage, y = pred_price, color = 'blue')
sns.lineplot(x 'Mileage')
plt.xlabel('Predicted car price')
plt.ylabel('${x:,.0f}')
ax.yaxis.set_major_formatter('{x:,.0f}') ax.xaxis.set_major_formatter(
The above model seems to better fit the data (as compared to the model without transformation) at least upto mileage around 125,000. The \(R^2\) of the model with the quadratic transformation of mileage
is also higher than that of the model without transformation indicating a better fit.
3.2.2 Cubic transformation
Let us see if a cubic transformation of mileage
can further improve the model fit.
#Including mileage squared and mileage cube as predictors and developing the model
= smf.ols(formula = 'price~mileage+I(mileage**2)+I(mileage**3)', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.283 |
Model: | OLS | Adj. R-squared: | 0.283 |
Method: | Least Squares | F-statistic: | 652.3 |
Date: | Sun, 22 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 23:33:27 | Log-Likelihood: | -54340. |
No. Observations: | 4959 | AIC: | 1.087e+05 |
Df Residuals: | 4955 | BIC: | 1.087e+05 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 3.598e+04 | 371.926 | 96.727 | 0.000 | 3.52e+04 | 3.67e+04 |
mileage | -0.7742 | 0.028 | -27.634 | 0.000 | -0.829 | -0.719 |
I(mileage ** 2) | 6.875e-06 | 4.87e-07 | 14.119 | 0.000 | 5.92e-06 | 7.83e-06 |
I(mileage ** 3) | -1.823e-11 | 1.98e-12 | -9.199 | 0.000 | -2.21e-11 | -1.43e-11 |
Omnibus: | 2380.788 | Durbin-Watson: | 0.321 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 23039.307 |
Skew: | 2.065 | Prob(JB): | 0.00 |
Kurtosis: | 12.719 | Cond. No. | 7.73e+14 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.73e+14. This might indicate that there are
strong multicollinearity or other numerical problems.
#Visualizing the model with the cubic transformation of mileage
= model.predict(train_updated)
pred_price = sns.scatterplot(x = 'mileage', y = 'price', data = train_updated, color = 'orange')
ax = train_updated.mileage, y = pred_price, color = 'blue')
sns.lineplot(x 'Mileage')
plt.xlabel('Predicted car price')
plt.ylabel('${x:,.0f}')
ax.yaxis.set_major_formatter('{x:,.0f}') ax.xaxis.set_major_formatter(
Note that the model fit with the cubic transformation of mileage
seems slightly better as compared to the models with the quadratic transformation, and no transformation of mileage, for mileage up to 180k. However, the model should not be used to predict car prices of cars with a mileage higher than 180k.
Let’s update the model created earlier (in the beginning of this chapter) to include the transformed predictor.
#Model with an interaction term and a variable transformation term
= smf.ols(formula = 'price~year*engineSize+mileage+mpg+I(mileage**2)', data = train_updated)
ols_object = ols_object.fit()
model model.summary()
Dep. Variable: | price | R-squared: | 0.702 |
Model: | OLS | Adj. R-squared: | 0.702 |
Method: | Least Squares | F-statistic: | 1947. |
Date: | Sun, 22 Jan 2023 | Prob (F-statistic): | 0.00 |
Time: | 23:42:13 | Log-Likelihood: | -52162. |
No. Observations: | 4959 | AIC: | 1.043e+05 |
Df Residuals: | 4952 | BIC: | 1.044e+05 |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 1.53e+06 | 2.7e+05 | 5.671 | 0.000 | 1e+06 | 2.06e+06 |
year | -755.7419 | 133.791 | -5.649 | 0.000 | -1018.031 | -493.453 |
engineSize | -2.022e+06 | 9.72e+04 | -20.803 | 0.000 | -2.21e+06 | -1.83e+06 |
year:engineSize | 1008.6993 | 48.196 | 20.929 | 0.000 | 914.215 | 1103.184 |
mileage | -0.3548 | 0.014 | -25.973 | 0.000 | -0.382 | -0.328 |
mpg | -54.7450 | 8.896 | -6.154 | 0.000 | -72.185 | -37.305 |
I(mileage ** 2) | 1.926e-06 | 1.04e-07 | 18.536 | 0.000 | 1.72e-06 | 2.13e-06 |
Omnibus: | 2355.448 | Durbin-Watson: | 0.562 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 38317.404 |
Skew: | 1.857 | Prob(JB): | 0.00 |
Kurtosis: | 16.101 | Cond. No. | 6.40e+12 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.4e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
Note that the R-squared has increased as compared to the model with just the interaction term.
#Computing RMSE on test data
= model.predict(testf)
pred_price - pred_price)**2).mean()) np.sqrt(((testp.price
9074.494088619422
Note that the prediction accuracy of the model has further increased, as the RMSE has reduced. The transformed predictor is statistically significant and provides additional flexibility to better capture the trend in the data, leading to an increase in prediction accuracy.
3.3 PolynomialFeatures()
The function PolynomialFeatures()
from the sklearn
library can be used to generate a predictor matrix that includes all interactions and transformations upto a degree d
.
= train[['mileage', 'engineSize', 'year', 'mpg']]
X_train = train[['price']]
y_train = test[['mileage', 'engineSize', 'year', 'mpg']]
X_test = test[['price']] y_test
3.3.1 Generating polynomial features
Let us generate polynomial features upto degree 2. This will include all the two-factor interactions, and all squared terms of degree 2.
= PolynomialFeatures(2, include_bias = False) # Create the object - degree is 2
poly
# Generate the polynomial features
= poly.fit_transform(X_train) X_train_poly
Note that the LinearRegression()
function adds the intercept by default (check the fit_intercept
argument). Thus, we have put include_bias = False
while generating the polynomial features, as we don’t need the intercept. The term bias
here refers to the intercept (you will learn about bias
in detail in STAT303-3). Another option is to include the intercept while generating the polynomial features, and put fit_intercept = False
in the LinearRegression()
function.
Below are the polynomial features generated by the PolynomialFeatures()
functions.
poly.get_feature_names_out()
array(['mileage', 'engineSize', 'year', 'mpg', 'mileage^2',
'mileage engineSize', 'mileage year', 'mileage mpg',
'engineSize^2', 'engineSize year', 'engineSize mpg', 'year^2',
'year mpg', 'mpg^2'], dtype=object)
3.3.2 Fitting the model
= LinearRegression()
model model.fit(X_train_poly, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
3.3.3 Testing the model
= poly.fit_transform(X_test) X_test_poly
#RMSE
np.sqrt(mean_squared_error(y_test, model.predict(X_test_poly)))
8896.175508213777
Note that the polynomial features have helped reduced the RMSE further.