#loading library for VIF
library(car)
6 Multicollinearity
Collinearity refers to the situation when two or more predictor variables have a high linear association. Linear association between a pair of variables can be measured by the correlation coefficient. Thus the correlation matrix can indicate some potential collinearity problems.
6.1 Why and how is collinearity a problem?
(Source: page 100-101 of ISLR)
The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response.
Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\hat \beta_j\) to grow. Recall that the t-statistic for each predictor is calculated by dividing \(\hat \beta_j\) by its standard error. Consequently, collinearity results in a decline in the \(t\)-statistic. As a result, in the presence of collinearity, we may fail to reject \(H_0: \beta_j = 0\). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity.
6.2 How to measure collinearity/multicollinearity?
(Source: page 102 of ISLR)
Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess multicollinearity is to compute the variance inflation factor (VIF). The VIF is variance inflation factor the ratio of the variance of \(\hat \beta_j\) when fitting the full model divided by the variance of \(\hat \beta_j\) if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
The estimated variance of the coefficient \(\beta_j\), of the \(j^{th}\) predictor \(X_j\), can be expressed as:
\[\hat{var}(\hat{\beta_j}) = \frac{(\hat{\sigma})^2}{(n-1)\hat{var}({X_j})}.\frac{1}{1-R^2_{X_j|X_{-j}}},\]
where \(R^2_{X_j|X_{-j}}\) is the \(R\)-squared for the regression of \(X_j\) on the other covariates (a regression that does not involve the response variable \(Y\)).
In case of simple linear regression, the variance expression in the equation above does not contain the term \(\frac{1}{1-R^2_{X_j|X_{-j}}}\), as there is only one predictor. However, in case of multiple linear regression, the variance of the estimate of the \(j^{th}\) coefficient (\(\hat{\beta_j}\)) gets inflated by a factor of \(\frac{1}{1-R^2_{X_j|X_{-j}}}\) (Note that in the complete absence of collinearity, \(R^2_{X_j|X_{-j}}=0\), and the value of this factor will be 1).
Thus, the Variance inflation factor, or the VIF for the estimated coefficient of the \(j^{th}\) predictor \(X_j\) is:
VIF\((\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}\)
Let us consider the dataset Credit.csv. We’ll compute VIF of the predictors when predicting the credit card balance of customers.
# Reading data
<- read.csv('./Datasets/Credit.csv')
credit_data head(credit_data)
Income Limit Rating Cards Age Education Own Student Married Region Balance
1 14.891 3606 283 2 34 11 No No Yes South 333
2 106.025 6645 483 3 82 15 Yes Yes Yes West 903
3 104.593 7075 514 4 71 11 No No No West 580
4 148.924 9504 681 3 36 11 Yes No No West 964
5 55.882 4897 357 2 68 16 No No Yes South 331
6 80.180 8047 569 4 77 10 No No No South 1151
# Developing model
<- lm(Balance~Limit+Rating+Cards+Age, data = credit_data)
model summary(model)
Call:
lm(formula = Balance ~ Limit + Rating + Cards + Age, data = credit_data)
Residuals:
Min 1Q Median 3Q Max
-710.90 -139.59 -13.04 133.87 829.10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -270.37864 55.78092 -4.847 1.8e-06 ***
Limit 0.11229 0.07458 1.506 0.132993
Rating 0.91305 1.11425 0.819 0.413034
Cards 22.85955 9.92792 2.303 0.021823 *
Age -2.38988 0.66529 -3.592 0.000369 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 227.8 on 395 degrees of freedom
Multiple R-squared: 0.7569, Adjusted R-squared: 0.7544
F-statistic: 307.4 on 4 and 395 DF, p-value: < 2.2e-16
Note that the coefficients of Limit
and Rating
are statistically insignificant at 5% significance level.
#Computing VIF
vif(model)
Limit Rating Cards Age
227.787786 228.437806 1.424475 1.012225
Note that the VIF of Income
and Rating
is very high indicating presence of multicollinearity. This may have lead to a high standard error for the coefficients of these predictors, thereby making them appear statistically insignificant.
Let us remove Rating
from the model as it seems to be redundant based on its high VIF.
<- lm(Balance~Limit+Cards+Age, data = credit_data)
model summary(model)
Call:
lm(formula = Balance ~ Limit + Cards + Age, data = credit_data)
Residuals:
Min 1Q Median 3Q Max
-698.13 -135.29 -14.94 133.38 809.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.486e+02 4.899e+01 -5.074 6.01e-07 ***
Limit 1.733e-01 4.965e-03 34.897 < 2e-16 ***
Cards 2.729e+01 8.323e+00 3.279 0.00113 **
Age -2.383e+00 6.650e-01 -3.584 0.00038 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 227.8 on 396 degrees of freedom
Multiple R-squared: 0.7565, Adjusted R-squared: 0.7546
F-statistic: 410 on 3 and 396 DF, p-value: < 2.2e-16
Note that now the coefficient of Limit
is highly statistically significant. This seems to be due to the reduction in standard error of the coefficient, which is potentially due to the reduction in multicollinearity from the model.
Let us re-check the VIF of the predictors of the updated model to see if multicollinearity has indeed been reduced.
vif(model)
Limit Cards Age
1.010319 1.001883 1.012080
The model no longer suffers from multicollinearity. This is an example that shows that we may make incorrect inference in the presence of multicollinearity.
Note that the \(R^2_{adj}\) of the updated model is almost the same as that of the previous model. Thus, the goodness-of-fit of the model seems to be unaffected by multicollinearity. However, this need not always be the case.
6.3 Perfect multicollinearity
The vif()
function fails to compute the variance inflation factor in case of perfect multicollinearity. In such a scenario, one would like to identify the perfectly multicollinear relationship, and remove the associated predictor(s) to eliminate perfect multicollinearity. The alias()
function can be used to identify such relationships. Consider the simulated example below.
Let x1
,…,x5
be predictor, and y
be the response.
# Simulating the data
= seq(0,1,0.01)
x1 = runif(101)
x2 = x1*2 - 0.5*x2
x3 = rnorm(101)
x4 =rnorm(101)
x5= x1+x3**2+rnorm(101) y
From the above simulated data, we can see that the x1
, x2
, and x3
are perfectly multicollinear. Let us try to find the VIF of the predictors.
<- lm(y~x1+x2+x3+x4+x5)
model vif(model)
On executing the above code, we obtain the error, there are aliased coefficients in the model. This indicates the presence of perfect multicollinearity. The perfectly multicollinear relationship can be identified using the alias()
function as shown below.
alias(lm(y~x1+x2+x3+x4+x5))
Model :
y ~ x1 + x2 + x3 + x4 + x5
Complete :
(Intercept) x1 x2 x4 x5
x3 0 2 -1/2 0 0
The above results shows that \(x_3 = 2x_1-0.5x_2\). Thus, we can remove any one of \(x_1, x_2, x_3\) from the model to eliminate perfect multicollinearity.
cor(cbind(x1,x2,x3,y))
x1 x2 x3 y
x1 1.00000000 -0.07112297 0.9703358 0.7532907
x2 -0.07112297 1.00000000 -0.3101619 -0.2668419
x3 0.97033583 -0.31016190 1.0000000 0.7826352
y 0.75329073 -0.26684195 0.7826352 1.0000000
Let us remove x1
as it is highly correlated with x3
. Removing x1
will reduce multicollinearity more as compared to removing x2
. Among x3
and x1
, x3
has a higher correlation with y
, so it is better to remove x1
.
<- lm(y~x2+x3+x4+x5)
model vif(model)
x2 x3 x4 x5
1.112431 1.128356 1.044870 1.027989
6.4 Addressing multicollinearity in polynomial models
In polynomial models, it is likely that \(X\) will be highly correlated with \(X^p\), resulting in imprecise estimates of the regression coefficents. Let us consider the following simulation example.
# Simulating data
<- seq(5,8,0.01)
X <- X + X^2 + rnorm(length(X))
y
<- lm(y~X+I(X^2))
model summary(model)
Call:
lm(formula = y ~ X + I(X^2))
Residuals:
Min 1Q Median 3Q Max
-2.88147 -0.66560 -0.00038 0.68735 2.91196
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.96675 3.44245 0.862 0.389
X 0.03921 1.07235 0.037 0.971
I(X^2) 1.07452 0.08234 13.049 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9647 on 298 degrees of freedom
Multiple R-squared: 0.9938, Adjusted R-squared: 0.9938
F-statistic: 2.404e+04 on 2 and 298 DF, p-value: < 2.2e-16
Note that the \(X\) appears to be statistically insignificant in the model. This may be due to the presence of multicollinearity. Let us compute the correlation between \(X\) and \(X^2\).
cor(X, X^2)
[1] 0.9982179
The high correlation indicates presence of multicollineairty. Multicollinearity can also be observed with the high value of VIF as shown below.
vif(model)
X I(X^2)
280.8106 280.8106
Centering the predictors may help reduce multicollinearity as shown below
<- X-mean(X)
x cor(x,x^2)
[1] 9.0533e-17
The correlation between \(X\) and \(X^2\), after centering the predictor, is almost 0.
<- lm(y~x+I(x^2))
model_centered vif(model_centered)
x I(x^2)
1 1
Note that multicollinearity has been eliminated in the model with centered predictor.
Also, the coefficients of both \(X\) and \(X^2\) are statistically significant (see model summary below) as they should be as per the true model.
summary(model_centered)
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-2.88147 -0.66560 -0.00038 0.68735 2.91196
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.62000 0.08341 582.93 <2e-16 ***
x 14.00794 0.06399 218.90 <2e-16 ***
I(x^2) 1.07452 0.08234 13.05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9647 on 298 degrees of freedom
Multiple R-squared: 0.9938, Adjusted R-squared: 0.9938
F-statistic: 2.404e+04 on 2 and 298 DF, p-value: < 2.2e-16
6.5 When can we overlook multicollinearity?
The severity of the problems increases with the degree of the multicollinearity. Therefore, if there is only moderate multicollinearity (5 < VIF < 10), we may overlook it.
Multicollinearity affects only the standard errors of the coefficients of collinear predictors. Therefore, if multicollinearity is not present for the predictors that we are particularly interested in, we may not need to resolve it.
Multicollinearity affects the standard error of the coefficients and thereby their \(p\)-values, but in general, it does not influence the prediction accuracy, except in the case that the coefficients are so unstable that the predictions are outside of the domain space of the response. If our sole aim is prediction, and we don’t wish to infer the statistical significance of predictors, then we may avoid addressing multicollinearity. “The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations, provided these inferences are made within the region of observations” - Neter, John, Michael H. Kutner, Christopher J. Nachtsheim, and William Wasserman. “Applied linear statistical models.” (1996): 318.