#loading libraries
library(ggplot2)
7 Variable interactions & Qualitative predictors
In this chapter, we’ll introduce interactions, and then develop and interpret models with interactions among quantitative predictors, and interactions between quantitative and qualitative predictors.
7.1 Interaction model
Let us consider the file car_data.csv.
# Reading data
<- read.csv('./Datasets/car_data.csv')
car_data head(car_data)
carID brand model year transmission mileage fuelType tax mpg
1 12002 hyundi Santa Fe 2017 Semi-Auto 32467 Diesel 235 42.9709
2 12003 vw Arteon 2019 Automatic 1555 Petrol 145 40.5071
3 12005 toyota Verso 2003 Automatic 104000 Petrol 300 34.5227
4 12006 ford Grand C-MAX 2018 Manual 5113 Petrol 145 47.6225
5 12007 bmw X6 2019 Automatic 9010 Diesel 145 35.2224
6 12008 toyota Prius 2016 Automatic 32853 Hybrid 10 63.8371
engineSize price
1 2.2 18991
2 1.5 22500
3 1.8 2395
4 1.0 14000
5 3.0 58700
6 1.8 22995
In an additive model, we assume that the association between a predictor \(X_j\) and response \(Y\) does not depend on the value of other predictors. For example, consider the multiple linear regression model below.
# Additive model
<- lm(price~year+engineSize+mileage+mpg, data = car_data)
additive_model summary(additive_model)
Call:
lm(formula = price ~ year + engineSize + mileage + mpg, data = car_data)
Residuals:
Min 1Q Median 3Q Max
-35346 -5131 -1605 2854 87509
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.661e+06 1.489e+05 -24.593 <2e-16 ***
year 1.818e+03 7.375e+01 24.647 <2e-16 ***
engineSize 1.218e+04 1.900e+02 64.107 <2e-16 ***
mileage -1.474e-01 8.768e-03 -16.817 <2e-16 ***
mpg -7.931e+01 9.338e+00 -8.493 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9564 on 4955 degrees of freedom
Multiple R-squared: 0.6605, Adjusted R-squared: 0.6602
F-statistic: 2410 on 4 and 4955 DF, p-value: < 2.2e-16
The above model assumes that the average increase in price associated with a unit increase in engineSize
is always $12,180, regardless of the value of other predictors. However, this assumption may be incorrect.
7.2 Interaction between continuous predictors
We can relax this assumption by considering another predictor, called an interaction term. Let us assume that the average increase in price
associated with a one-unit increase in engineSize
depends on the model year
of the car. In other words, there is an interaction between engineSize
and year
. This interaction can be included as a predictor, which is the product of engineSize
and year
. Note that there are several possible interactions that we can consider. Here the interaction between engineSize
and year
is just an example.
# Interaction model
<- lm(price~year*engineSize+mileage+mpg, data = car_data)
interaction_model summary(interaction_model)
Call:
lm(formula = price ~ year * engineSize + mileage + mpg, data = car_data)
Residuals:
Min 1Q Median 3Q Max
-40479 -4929 -1548 2864 85271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.606e+05 2.737e+05 2.048 0.0406 *
year -2.754e+02 1.357e+02 -2.029 0.0425 *
engineSize -1.796e+06 9.968e+04 -18.019 <2e-16 ***
mileage -1.525e-01 8.496e-03 -17.954 <2e-16 ***
mpg -8.434e+01 9.048e+00 -9.322 <2e-16 ***
year:engineSize 8.968e+02 4.943e+01 18.142 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9262 on 4954 degrees of freedom
Multiple R-squared: 0.6816, Adjusted R-squared: 0.6813
F-statistic: 2121 on 5 and 4954 DF, p-value: < 2.2e-16
Note that the \(R^2\) has increased as compared to the additive model, since we added a predictor.
The model equation is:
price
= \(\beta_0\) + \(\beta_1\)year
+ \(\beta_2\)engineSize
+ \(\beta_3\)(year
\(\times\) engineSize
) + \(\beta_4\)mileage
+ \(\beta_5\)mpg
, or
price
= \(\beta_0\) + \(\beta_1\)year
+ (\(\beta_2+\beta_3\)year
) \(\times\) engineSize
+ \(\beta_4\)mileage
+ \(\beta_5\)mpg
, or
price
= \(\beta_0 + \beta_1\)year
+ \(\tilde \beta\)engineSize
+ \(\beta_4\)mileage
+ \(\beta_5\)mpg
,
Since \(\tilde \beta\) is a function of year
, the association between engineSize
and price
is no longer a constant. A change in the value of year
will change the association between price
and engineSize
.
Substituting the values of the coefficients:
price
= 5.606e5 - 275.3833year
+ (-1.796e6+896.7687year
)engineSize
-0.1525mileage
-84.3417mpg
Thus, for cars launched in the year 2010, the average increase in price for one liter increase in engine size is -1.796e6 + 896.7687 * 2010 \(\approx\) $6,500, assuming all the other predictors are constant. However, for cars launched in the year 2020, the average increase in price for one liter increase in engine size is -1.796e6 + 896.7687*2020 \(\approx\) $15,500 , assuming all the other predictors are constant.
Similarly, the equation can be re-arranged as:
price
= 5.606e5 +(-275.3833+896.7687engineSize
)year
-1.796e6engineSize
-0.1525mileage
-84.3417mpg
Thus, for cars with an engine size of 2 litres, the average increase in price for a one year newer model is -275.3833+896.7687 * 2 \(\approx\) $1500, assuming all the other predictors are constant. However, for cars with an engine size of 3 litres, the average increase in price for a one year newer model is -275.3833+896.7687 * 3 \(\approx\) $2400, assuming all the other predictors are constant.
7.3 Qualitative predictors
Let us develop a model for predicting price
based on engineSize
and the qualitative predictor transmission
.
#checking the distribution of values of transmission
table(car_data$transmission)
Automatic Manual Other Semi-Auto
1660 1948 1 1351
Note that the Other
category of the variable transmission
contains only a single observation, which is likely to be insufficient to train the model. We’ll remove that observation from the car data. Another option may be to combine the observation in the Other
category with the nearest category, and keep it in the data.
= car_data[car_data$transmission!='Other',] car_data
<- lm(price~engineSize+transmission, data = car_data)
qual_pred_model summary(qual_pred_model)
Call:
lm(formula = price ~ engineSize + transmission, data = car_data)
Residuals:
Min 1Q Median 3Q Max
-47181 -6726 -1145 5204 95998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3042.7 661.2 4.602 4.29e-06 ***
engineSize 10226.8 247.5 41.323 < 2e-16 ***
transmissionManual -6770.6 442.1 -15.314 < 2e-16 ***
transmissionSemi-Auto 4994.3 443.0 11.274 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12080 on 4955 degrees of freedom
Multiple R-squared: 0.4587, Adjusted R-squared: 0.4584
F-statistic: 1400 on 3 and 4955 DF, p-value: < 2.2e-16
Note that there is no coefficient for the Automatic
level of the variable Transmission
. If a car doesn’t have Manual
or Semi-Automatic
transmission, then it has an Automatic
transmission. Thus, the coefficient of Automatic
will be redundant, and the dummy variable corresponding to Automatic
transmission is dropped from the model.
The level of the categorical variable that is dropped from the model is called the baseline level. Here Automatic
transmission is the baseline level. The coefficients of other levels of transmission
should be interpreted with respect to the baseline level.
Q: Interpret the intercept term.
Ans: For the hypothetical scenario of a car with zero engine size and Automatic
transmission, the estimated mean car price is \(\approx\) $3042.
Q: Interpret the coefficient of transmissionManual
.
Ans: The estimated mean price of a car with manual transmission is \(\approx\) $6770 less than that of a car with Automatic transmission.
Let us visualize the developed model.
<- c("Automatic" = "red", "Manual" = "blue", "Semi-Automatic" = "green")
colors
<- qual_pred_model$coefficients
coefs <- car_data$engineSize
x ggplot(data = car_data, aes(x = engineSize))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize'], color = 'Automatic'))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize']+coefs['transmissionManual'], color = 'Manual'))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize']+coefs['transmissionSemi-Auto'], color = 'Semi-Automatic'))+
theme(legend.title = element_blank(),
legend.position = c(0.15,0.85))+
labs(
y = 'Predicted car price',
x = 'Engine size (in litre)'
)
Based on the developed model, for a given engine size, the car with a semi-automatic transmission is estimated to be the most expensive on average, while the car with a manual transmission is estimated to be the least expensive on average.
Q: What is the expected difference between the price of a car with Manual Transmission and a car with a Semi-Automatic transmission?
A: The expected difference is \(\hat{\beta}_{I(Tranmission = Manual)} - \hat{\beta}_{I(Tranmission = Semi-Automatic)} = -6770.6 - 4994.3 = -\$11,764.9\)
Q: Find the 95% confidence interal for the point estimate obtained in the previous question.
A: Let us compute the standard error of the point estimate:
\(Var(\hat{\beta}_{I(Tranmission = Manual)} - \hat{\beta}_{I(Tranmission = Semi-Automatic)}) = Var(\hat{\beta}_{I(Tranmission = Manual)}) + Var(\hat{\beta}_{I(Tranmission = Semi-Automatic)}) - 2CoVar(\hat{\beta}_{I(Tranmission = Manual)}, \hat{\beta}_{I(Tranmission = Semi-Automatic)})\)
The R function vcov()
provides the variance-covariance matrix of the regression coefficients, which can be used the evaluate the above expression.
# Variance-covariance matrix of the regression coefficients
<- vcov(qual_pred_model) vcov_matrix
We’ll use the variance-covariance matrix to compute the variance of the point estimate, and use the variance to compute the confidence interval:
# Variance of the point estimate
= vcov_matrix[3,3] + vcov_matrix[4,4] - 2*vcov_matrix[3,4]
variance_point_estimate
# Standard deviation of the point estimate
= sqrt(variance_point_estimate)
std_point_estimate
# Upper bound of the 95% CI of the point estimate
<- -11764.9 + std_point_estimate*qt(0.975, 4959-4)
UB print(paste0("Upper bound = ",UB))
[1] "Upper bound = -10837.3953392871"
<- -11764.9 - std_point_estimate*qt(0.975, 4959-4)
LB print(paste0("Lower bound = ", LB))
[1] "Lower bound = -12692.4046607129"
The 95% confidence interval is: [-$10.8k,-$12.7k].
7.4 Interaction between qualitative and continuous predictors
Note that the qualitative predictor leads to fitting 3 parallel lines to the data, as there are 3 categories.
However, note that we have made the constant association assumption. The fact that the lines are parallel means that the average increase in car price for one litre increase in engine size does not depend on the type of transmission. This represents a potentially serious limitation of the model, since in fact a change in engine size may have a very different association on the price of an automatic car versus a semi-automatic or manual car.
This limitation can be addressed by adding an interaction variable, which is the product of engineSize
and the dummy variables for semi-automatic and manual transmissions.
<- lm(price~engineSize*transmission, data = car_data)
qual_pred_int_model summary(qual_pred_int_model)
Call:
lm(formula = price ~ engineSize * transmission, data = car_data)
Residuals:
Min 1Q Median 3Q Max
-56431 -6453 -1033 5184 96479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3754.7 895.2 4.194 2.79e-05 ***
engineSize 9928.6 354.5 28.006 < 2e-16 ***
transmissionManual 1768.6 1294.1 1.367 0.171786
transmissionSemi-Auto -5282.7 1416.5 -3.729 0.000194 ***
engineSize:transmissionManual -5285.9 646.2 -8.180 3.57e-16 ***
engineSize:transmissionSemi-Auto 4162.2 552.6 7.532 5.90e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11850 on 4953 degrees of freedom
Multiple R-squared: 0.4788, Adjusted R-squared: 0.4782
F-statistic: 909.9 on 5 and 4953 DF, p-value: < 2.2e-16
The model equation for the model with interactions is:
Automatic transmission: price
= 3754.7238 + 9928.6082engineSize
,
Semi-Automatic transmission: price
= 3754.7238 + 9928.6082engineSize
+ (-5282.7164+4162.2428engineSize
),
Manual transmission: price
= 3754.7238 + 9928.6082engineSize
+(1768.5856-5285.9059engineSize
), or
Automatic transmission: price
= 3754.7238 + 9928.6082engineSize
,
Semi-Automatic transmission: price
= -1527 + 7046engineSize
,
Manual transmission: price
= 5523 + 4642engineSize
.
Q: Interpret the coefficient of manual tranmission, i.e., the coefficient of transmissionManual
.
A: For the hypothetical scenario of a car with zero engine size,the estimated mean price
of a car with Manual
transmission is \(\approx\) $1768 more than the estimated mean price
of a car with Automatic
transmission.
Q: Interpret the coefficient of the interaction between engine size and manual transmission, i.e., the coefficient of engineSize:transmissionManual
.
A: For a unit (or a litre) increase in engineSize
, the increase in estimated mean price
of a car with Manual
transmission is \(\approx\) $5285 less than the increase in estimated mean price
of a car with Automatic
transmission.
<- c("Automatic" = "red", "Manual" = "blue", "Semi-Automatic" = "green")
colors
<- qual_pred_int_model$coefficients
coefs <- car_data$engineSize
x ggplot(data = car_data, aes(x = engineSize))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize'], color = 'Automatic'))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize']+coefs['transmissionManual']+x*coefs['engineSize:transmissionManual'], color = 'Manual'))+
geom_line(aes(y = coefs['(Intercept)']+x*coefs['engineSize']+coefs['transmissionSemi-Auto']+x*coefs['engineSize:transmissionSemi-Auto'], color = 'Semi-Automatic'))+
theme(legend.title = element_blank(),
legend.position = c(0.15,0.85))+
labs(
y = 'Predicted car price',
x = 'Engine size (in litre)'
)
Note the interaction term adds flexibility to the model.
The slope of the regression line for semi-automatic cars is the largest. This suggests that increase in engine size is associated with a higher increase in car price for semi-automatic cars, as compared to other cars.