library(ISLR) #for the dataset
library(ggplot2) #for making plots
library(lmtest) #for bptest()
library(MASS) #for boxcox()
4 Diagnostics and Remedial measures (SLR)
Let us consider the Auto
dataset from the ISLR
library. The library is related to the book Introduction to Statistical Learning
4.1 Loading libraries
4.2 Reading data
<- Auto auto_data
4.3 Developing model
Suppose we wish to predict mpg
based on horsepower
.
<- lm(mpg~horsepower, data = auto_data) linear_model
4.4 Model fit
Let us visualize the model fit to the data.
ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
geom_point()+
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
The plot indicates that a quadratic relationship between mpg
and horsepower
may be more appropriate.
4.5 Diagnostic plots & statistical tests
Let us make diagnostic plots that help us check the model assumptions.
par(mfrow = c(2,2))
plot(linear_model)
4.5.1 Linear relationship
4.5.1.1 Visual check
We can check the linearity assumption visually by analyzing the plot of residuals against fitted values. The plot indicates that a quadratic relationship between the response and the predictor may be more appropriate than a linear one.
4.5.1.2 Statistical test
We’ll use the F test for lack of fit to check the linearity assumption
#Full model
<- lm(mpg~as.factor(horsepower), data = auto_data) full_model
# F test for lack of fit
anova(linear_model, full_model)
Analysis of Variance Table
Model 1: mpg ~ horsepower
Model 2: mpg ~ as.factor(horsepower)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 390 9385.9
2 299 4702.8 91 4683.1 3.272 1.125e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected based on the visual check, the linear model fails the linearity test with a very low \(p\)-value of the order of \(10^{-14}\). Thus, we conclude that relationship is non-linear.
4.5.2 Constant variance (Homoscedasticity)
4.5.2.1 Visual check
The plot of the residuals against fitted values indicates that the error variance is increasing with increasing values of predicted mpg
or decreasing horsepower
. Thus, there seems to be heteroscedasticity.
4.5.2.2 Statistical test
Breusch-Pagan test
Let us conduct the Breusch-Pagan test for homoscedasticity. This is a large sample test, which assumes that the error terms are independent and normally distruted, and that the variance of of the error term \(\epsilon_i\), denoted by \(\sigma_i^2\), is related to the level of the predictor \(X\) in the following way:
\(\log(\sigma_i^2) = \gamma_0 + \gamma_1X_i\).
This implies that \(\sigma_i^2\) either increases or decreases with the level of \(X\). Constant error variance corresponds to \(\gamma_1 = 0\). The hypotheses for the test are:
\(H_0: \gamma_1 = 0\) \(H_a: \gamma_1 \ne 0\)
bptest(linear_model)
studentized Breusch-Pagan test
data: linear_model
BP = 8.7535, df = 1, p-value = 0.00309
As expected based on the visual check, the model fails the homoscedasticity with a low \(p\)-value of 0.3%.
Brown-Forsythe test
To conduct the Brown-Forsythe test, we divide the data into two groups, based on the predictor \(X\). If the error variance is not constant, the residuals in one group will tend to be more variable than those in the other group. The test consists of a two-sample \(t\)-test to determine whether the mean of the absolute deviations from the median residual of one group differs significantly from the mean of the absolute deviations from the median residual of the other group.
#Break the residuals into 2 groups
<- linear_model$residuals[auto_data$horsepower<=100]
residuals_group1 <- linear_model$residuals[auto_data$horsepower>100] residuals_group2
#Obtain the median of each group
<- median(residuals_group1)
median_group1 <- median(residuals_group2) median_group2
#Two-sample t-test
t.test(abs(residuals_group1-median_group1), abs(residuals_group2-median_group2), var.equal = TRUE)
Two Sample t-test
data: abs(residuals_group1 - median_group1) and abs(residuals_group2 - median_group2)
t = 4.4039, df = 390, p-value = 1.375e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7738625 2.0220633
sample estimates:
mean of x mean of y
4.308698 2.910735
Brown-Forsythe test also shows that there is heteroscedasticity.
4.5.3 Normality of error terms
4.5.3.1 Visual check
The QQ plot indicates that the distribution of the residuals is slightly right-skewed. This right-skew can be visualized by making a histogram or a density plot of residuals:
par(mfrow = c(1,1))
hist(linear_model$residuals, breaks = 20, prob = TRUE)
lines(density(linear_model$residuals), col = 'blue')
4.5.3.2 Statistical test
shapiro.test(linear_model$residuals)
Shapiro-Wilk normality test
data: linear_model$residuals
W = 0.98207, p-value = 8.734e-05
As expected based on the visual check, the model fails the test for normal distribution of error terms with a low \(p\)-value of the order of \(10^{-5}\).
4.5.4 Independence of error terms
This test is relevant if there is an order in the data, i.e., if the observations can be arranged based on time, space, etc. In the current example, there is no order, and so this test is not relevant.
4.6 Remedial Measures
If the linear relationship assumption violated, it results in biased estimates of the regression coefficients resulting in a biased prediction. As the estimates are biased, their variance estimates may also be inaccurate. Thus, this is a very important assumption as it leads to inaccuracies in both point estimates and statistical inference.
If the homoscedasticity assumption is violated, but the linear relationship assumption holds, the OLS estimates are still unbiased resulting in an unbiased prediction. However, their standard error estimates are likely to be inaccurate.
If the assumption regarding the normal distribution of errors is violated, the OLS estimates are still BLUE (Best linear unbiased estimates). The confidence interval for a point estimate, though effected, is robust to departures from normality for large sample sizes. However, the prediction interval is sensitive to the same.
Our model above (linear_model
) violates all the three assumptions.
Note that there is no one unique way of fixing all the model assumption violations. Its an iterative process, and there is some trial and error involved. Two different approaches may lead to two different models, and both models may be more or less similar (in terms of point estimates / inference)
4.6.1 Addressing linear relationship assumption violation
Let us address the linear relationship assumption first.
As indicated in the diagnostic plots, the relationship between \(X\) and \(Y\) seemed to be quadratic. We’ll change the simple linear regression model to the following linear regression model, where \(Y\) and \(X\) have a quadratic relationship:
\(Y_i = \beta_0 + \beta_1X_i + \beta_2X_i^2\)
Let us fit the above model to the data:
<- lm(mpg~horsepower + I(horsepower**2), data = auto_data) quadratic_model
Note that in the formula specified within the lm()
function, the I()
operator isolates or insulates the contents within I(…)
from the regular formula operators. Without the I()
operator, horsepower**2
will be treated as the interaction of horsepower with itself, which is horsepower. Thus, to add the square of horsepower as a separate predictor, we need to use the I()
operator.
After every transformation or remedial measure, we should diagnoze the model to check the model assumptions are satisfied.
par(mfrow=c(2,2))
plot(quadratic_model)
We observe that the departure from the linear relationship assumption has been reduced. However, there still seems to be heteroscedasticity and non-normal error terms. Let us verify this with statistical tests.
#F test for lack of fit
anova(quadratic_model, full_model)
Analysis of Variance Table
Model 1: mpg ~ horsepower + I(horsepower^2)
Model 2: mpg ~ as.factor(horsepower)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 7442.0
2 299 4702.8 90 2739.3 1.9351 1.897e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even though the model still fails the linearity test, the extent of violation is reduced as the \(p\)-value is now of the order of \(10^{-5}\) instead of \(10^{-14}\).
# Breusch-Pagan test
bptest(quadratic_model)
studentized Breusch-Pagan test
data: quadratic_model
BP = 34.528, df = 2, p-value = 3.179e-08
The quadratic model has a more severe heteroscedasticity as compared to the linear model based on the \(p\)-value.
shapiro.test(quadratic_model$residuals)
Shapiro-Wilk normality test
data: quadratic_model$residuals
W = 0.98268, p-value = 0.0001214
The severity of departure from normality has been reduced in the quadratic model.
4.6.2 Addressing heteroscedasticity
Let us address the heteroscedasticity assumption.
We can see from the diagnostic plot of the quadratic model that the error variance is higher for higher fitted values. A possible way to resolve this is to transform the response \(Y\) using a concave function such as log(\(Y\)) or \(\sqrt{Y}\). Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity. Let us try the log(\(Y\)) transformation to address heteroscedasticity.
<- lm(log(mpg)~horsepower+I(horsepower**2), data = auto_data)
log_model par(mfrow=c(2,2))
plot(log_model)
Based on the plot of residuals against fitted values, heteroscedasticity seems to be mitigated to some extent. Let us confirm with the statistical test.
bptest(log_model)
studentized Breusch-Pagan test
data: log_model
BP = 5.2023, df = 2, p-value = 0.07419
Assuming a significance level of \(5\%\), we do not reject the null hypothesis that the error variance is constant. Thus, the homoscedasticity assumption holds on the log transformed model.
The linear relationship assumptions seems to be satisfied. Let us check it via the \(F\) lack of fit test.
<- lm(log(mpg)~as.factor(horsepower), data = auto_data)
log_full_model anova(log_model, log_full_model)
Analysis of Variance Table
Model 1: log(mpg) ~ horsepower + I(horsepower^2)
Model 2: log(mpg) ~ as.factor(horsepower)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 12.100
2 299 7.586 90 4.5138 1.9768 1.032e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Although the \(p\)-value is higher than both the previous models indicating that the departure from linear relationship has been reduced, the model fails the \(F\) lack of fit test. However, from the residual plot, there didn’t seem to be a strong departure from the linear relationship assumption. The failure in the statistical test may be due to the presence of outliers in the data. Let us visualize the model full model along with the
<- c("Full log model" = "red", "Log linear model" = "blue")
colors
ggplot(data = auto_data, aes(x = horsepower, y = log(mpg)))+
geom_point(col = "orange")+
geom_line(aes(y = log_full_model$fitted.values, color = "Full log model"))+
geom_line(aes(y = log_model$fitted.values, color = "Log linear model"))+
scale_color_manual(values = colors)+
theme(legend.title = element_blank(),
legend.position = c(0.85,0.9))
We observe that the full log model seems to be effected by outlying observations and explaining some of the noise. With the addition of more replicates at the \(X\) values corresponding to the outlying observations, the model may have passed the \(F\) lack of fit test. Thus, in this case, it appears that the relationship is indeed linear, but the failure in the statistical test is due to the full model over-fitting the data and explaining some noise as well. Another option is to bin the horsepower
variable such that each bin has several replicates, and then do the \(F\) lack of fit test.
From the QQ plot, it appears that the error distribution is normal, except for three outlying values. Let us confirm the same from the normality test.
shapiro.test(log_model$residuals)
Shapiro-Wilk normality test
data: log_model$residuals
W = 0.99154, p-value = 0.02458
The \(p\)-value is high supporting the claim that the model satisfies the assumption of normality of error terms. The \(p\)-value would probably be higher if it was not for those three outlying points. The next step could be to analyze those three points to figure out the reason they are outlying values. For example, those three points may correspond to a particular car model that has only three observations in the data.
4.7 Box-Cox transformation
Sometimes it may be difficult to determine the appropriate transformation from diagnostic plots to correct the skewness of the error terms, unequal error variances, and non-linearity of the error function. The Box-Cox procedure automatically identifies a transformation from the family of power transformations on \(Y\). The family of power transformations is of the form:
\(Y' = Y^{\lambda}\)
The normal error regression model with the response variable a member of the family of power transformations becomes:
\(Y_i^{\lambda} = \beta_0 + \beta_1X_i\)
The Box-Cox procedure uses the method of Maximum likelihood to estimate \(\lambda\), as well as other parameters \(\beta_0, \beta_1\), and \(\sigma^2\).
Let us use the Box-Cox procedure to identify the appropriate transformation for the linear_model
.
boxcox(linear_model)
A plot of the log-likelihood vs \(\lambda\) is returned with a confidence interval for the optimal value of \(\lambda\). We can zoom-in the plot to identify the appropriate value of \(\lambda\) where the log-likelihood maximizes.
boxcox(linear_model, lambda = seq(-1, 0, 0.1))
A value of \(\lambda = -0.5\) seems appropriate. Note that the log-likelihood is based on a sample. Thus, typically we don’t select the exact value of \(\lambda\) where the log-likeliood maximizes, but any value that is easier to interpret within the confidence interval. Let us consider \(\lambda = -0.5\) in this case.
<- lm(mpg^-0.5~horsepower, data = auto_data)
boxcox_model par(mfrow=c(2,2))
plot(boxcox_model)
Based on the plots, the model seems to satisfy the normality of errors and homoscedasticity assumption, but there appears to be a quadratic relationship between the response and the predictor. Let us confirm our visual analysis with statistical tests.
bptest(boxcox_model)
studentized Breusch-Pagan test
data: boxcox_model
BP = 10.484, df = 1, p-value = 0.001204
The homoscedasticity assumption is violated, but not too severely.
shapiro.test(boxcox_model$residuals)
Shapiro-Wilk normality test
data: boxcox_model$residuals
W = 0.99449, p-value = 0.1721
Box-Cox transformation indeed resulted in the normal distribution of error terms.
<- lm(mpg^-0.5~as.factor(horsepower), data = auto_data)
power_full_model anova(boxcox_model, power_full_model)
Analysis of Variance Table
Model 1: mpg^-0.5 ~ horsepower
Model 2: mpg^-0.5 ~ as.factor(horsepower)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 390 0.152325
2 299 0.085073 91 0.067252 2.5974 6.074e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected based on the visual analysis, the relationship has a large departure from linearity.
Note that the Box-Cox transformation can identify the appropriate transformation for the response, but not for the predictors. Thus, we need to figure out the appropriate predictor transformation based on the diagnostic plots and/or trial and error. Let us try the quadratic transformation of the predictor \(X\) as the relationship appears to be quadratic based on the residual plot.
<- lm(mpg^-0.5~horsepower+I(horsepower**2), data = auto_data)
quadratic_boxcox_model par(mfrow = c(2,2))
plot(quadratic_boxcox_model)
Note that the severity of the linearity assumption violation has been reduced, while the homoscedasticity and normality of errors assumptions seem to be satisfied. Note that transformation of the predictor \(X\) doesn’t change the distribution of errors. Thus, it was expected that the transformation on \(X\) will not deteriorate the model in terms of the assumptions regarding the distribution of error terms.
Let us confirm our intuition based on diagnostic plots with statistical tests.
bptest(quadratic_boxcox_model)
studentized Breusch-Pagan test
data: quadratic_boxcox_model
BP = 0.96746, df = 2, p-value = 0.6165
As expected, the errors are homoscedastic.
shapiro.test(quadratic_boxcox_model$residuals)
Shapiro-Wilk normality test
data: quadratic_boxcox_model$residuals
W = 0.98943, p-value = 0.006297
There is a slight deviation from the normal distribution, but its only due to three points as we saw earlier.
anova(quadratic_boxcox_model, power_full_model)
Analysis of Variance Table
Model 1: mpg^-0.5 ~ horsepower + I(horsepower^2)
Model 2: mpg^-0.5 ~ as.factor(horsepower)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 0.138224
2 299 0.085073 90 0.053151 2.0756 2.372e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model fails the linearity assumption test. But as noted earlier, this failure is due to the presence of very few or no replicates at some points. This potentially results in the full model explaining the noise, which in turn results in the model failing the linearity test. Again, we can bin the predictor values to have replicates for every bin, and then the model is likely to satisfy the linearity assumption.
As shown in the plot below, the full model is overfitting the data, and explaining some noise due to the absence of replicates for some values of horsepower.
ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
geom_point()+
geom_line(aes(y = power_full_model$fitted.values^-2), color = "blue")
To resolve this issue, we will bin the predictor values, so that we have replicates for each distinct value of the predictor, and get a better fit for the full model
#Binning horsepower into 20 equal width bins
<- cut(auto_data$horsepower, breaks = 20) horsepower_bins
# Full model based on binned horsepower
<- lm(mpg^-0.5~horsepower_bins, data = auto_data)
power_full_model_binned_hp anova(quadratic_boxcox_model, power_full_model_binned_hp)
Analysis of Variance Table
Model 1: mpg^-0.5 ~ horsepower + I(horsepower^2)
Model 2: mpg^-0.5 ~ horsepower_bins
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 0.13822
2 372 0.12912 17 0.0091085 1.5437 0.07696 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that now the quadratic_boxcox_model
model saitisfies the linearity assumption (assuming a significance level of 5%).
Note that our full model based on the binned horsepower
values seems to better fit the data as shown in the plot below.
<- c("Full model (binned predictor)" = "red", "Quadratic Box-Cox model" = "blue")
colors ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
geom_point()+
geom_line(aes(y = quadratic_boxcox_model$fitted.values^-2, color = "Quadratic Box-Cox model"))+
geom_line(aes(y = power_full_model_binned_hp$fitted.values^-2, color = "Full model (binned predictor)"))+
scale_color_manual(values = colors)+
theme(legend.title = element_blank(),
legend.position = c(0.75,0.85))