6 Multicollinearity

Collinearity refers to the situation when two or more predictor variables have a high linear association. Linear association between a pair of variables can be measured by the correlation coefficient. Thus the correlation matrix can indicate some potential collinearity problems.

6.1 Why and how is collinearity a problem?

(Source: page 100-101 of ISLR)

The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response.

Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for \(\hat \beta_j\) to grow. Recall that the t-statistic for each predictor is calculated by dividing \(\hat \beta_j\) by its standard error. Consequently, collinearity results in a decline in the \(t\)-statistic. As a result, in the presence of collinearity, we may fail to reject \(H_0: \beta_j = 0\). This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity.

6.2 How to measure collinearity/multicollinearity?

(Source: page 102 of ISLR)

Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess multicollinearity is to compute the variance inflation factor (VIF). The VIF is variance inflation factor the ratio of the variance of \(\hat \beta_j\) when fitting the full model divided by the variance of \(\hat \beta_j\) if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

The estimated variance of the coefficient \(\beta_j\), of the \(j^{th}\) predictor \(X_j\), can be expressed as:

\[\hat{var}(\hat{\beta_j}) = \frac{(\hat{\sigma})^2}{(n-1)\hat{var}({X_j})}.\frac{1}{1-R^2_{X_j|X_{-j}}},\]

where \(R^2_{X_j|X_{-j}}\) is the \(R\)-squared for the regression of \(X_j\) on the other covariates (a regression that does not involve the response variable \(Y\)).

In case of simple linear regression, the variance expression in the equation above does not contain the term \(\frac{1}{1-R^2_{X_j|X_{-j}}}\), as there is only one predictor. However, in case of multiple linear regression, the variance of the estimate of the \(j^{th}\) coefficient (\(\hat{\beta_j}\)) gets inflated by a factor of \(\frac{1}{1-R^2_{X_j|X_{-j}}}\) (Note that in the complete absence of collinearity, \(R^2_{X_j|X_{-j}}=0\), and the value of this factor will be 1).

Thus, the Variance inflation factor, or the VIF for the estimated coefficient of the \(j^{th}\) predictor \(X_j\) is:

VIF\((\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}\)

Let us consider the dataset Credit.csv. We’ll compute VIF of the predictors when predicting the credit card balance of customers.

#loading library for VIF
library(car)

# Reading data
credit_data <- read.csv('./Datasets/Credit.csv')
head(credit_data)

   Income Limit Rating Cards Age Education Own Student Married Region Balance
1  14.891  3606    283     2  34        11  No      No     Yes  South     333
2 106.025  6645    483     3  82        15 Yes     Yes     Yes   West     903
3 104.593  7075    514     4  71        11  No      No      No   West     580
4 148.924  9504    681     3  36        11 Yes      No      No   West     964
5  55.882  4897    357     2  68        16  No      No     Yes  South     331
6  80.180  8047    569     4  77        10  No      No      No  South    1151

# Developing model
model <- lm(Balance~Limit+Rating+Cards+Age, data = credit_data)
summary(model)


Call:
lm(formula = Balance ~ Limit + Rating + Cards + Age, data = credit_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-710.90 -139.59  -13.04  133.87  829.10 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -270.37864   55.78092  -4.847  1.8e-06 ***
Limit          0.11229    0.07458   1.506 0.132993    
Rating         0.91305    1.11425   0.819 0.413034    
Cards         22.85955    9.92792   2.303 0.021823 *  
Age           -2.38988    0.66529  -3.592 0.000369 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 227.8 on 395 degrees of freedom
Multiple R-squared:  0.7569,    Adjusted R-squared:  0.7544 
F-statistic: 307.4 on 4 and 395 DF,  p-value: < 2.2e-16

Note that the coefficients of Limit and Rating are statistically insignificant at 5% significance level.

#Computing VIF
vif(model)

     Limit     Rating      Cards        Age 
227.787786 228.437806   1.424475   1.012225

Note that the VIF of Income and Rating is very high indicating presence of multicollinearity. This may have lead to a high standard error for the coefficients of these predictors, thereby making them appear statistically insignificant.

Let us remove Rating from the model as it seems to be redundant based on its high VIF.

model <- lm(Balance~Limit+Cards+Age, data = credit_data)
summary(model)


Call:
lm(formula = Balance ~ Limit + Cards + Age, data = credit_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-698.13 -135.29  -14.94  133.38  809.35 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.486e+02  4.899e+01  -5.074 6.01e-07 ***
Limit        1.733e-01  4.965e-03  34.897  < 2e-16 ***
Cards        2.729e+01  8.323e+00   3.279  0.00113 ** 
Age         -2.383e+00  6.650e-01  -3.584  0.00038 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 227.8 on 396 degrees of freedom
Multiple R-squared:  0.7565,    Adjusted R-squared:  0.7546 
F-statistic:   410 on 3 and 396 DF,  p-value: < 2.2e-16

Note that now the coefficient of Limit is highly statistically significant. This seems to be due to the reduction in standard error of the coefficient, which is potentially due to the reduction in multicollinearity from the model.

Let us re-check the VIF of the predictors of the updated model to see if multicollinearity has indeed been reduced.

vif(model)

   Limit    Cards      Age 
1.010319 1.001883 1.012080

The model no longer suffers from multicollinearity. This is an example that shows that we may make incorrect inference in the presence of multicollinearity.

Note that the \(R^2_{adj}\) of the updated model is almost the same as that of the previous model. Thus, the goodness-of-fit of the model seems to be unaffected by multicollinearity. However, this need not always be the case.

6.3 Perfect multicollinearity

The vif() function fails to compute the variance inflation factor in case of perfect multicollinearity. In such a scenario, one would like to identify the perfectly multicollinear relationship, and remove the associated predictor(s) to eliminate perfect multicollinearity. The alias() function can be used to identify such relationships. Consider the simulated example below.

Let x1,…,x5 be predictor, and y be the response.

# Simulating the data
x1 = seq(0,1,0.01)
x2 = runif(101)
x3 = x1*2 - 0.5*x2
x4 = rnorm(101)
x5=rnorm(101)
y = x1+x3**2+rnorm(101)

From the above simulated data, we can see that the x1, x2, and x3 are perfectly multicollinear. Let us try to find the VIF of the predictors.

model <- lm(y~x1+x2+x3+x4+x5)
vif(model)

On executing the above code, we obtain the error, there are aliased coefficients in the model. This indicates the presence of perfect multicollinearity. The perfectly multicollinear relationship can be identified using the alias() function as shown below.

alias(lm(y~x1+x2+x3+x4+x5))

Model :
y ~ x1 + x2 + x3 + x4 + x5

Complete :
   (Intercept) x1   x2   x4   x5  
x3    0           2 -1/2    0    0

The above results shows that \(x_3 = 2x_1-0.5x_2\). Thus, we can remove any one of \(x_1, x_2, x_3\) from the model to eliminate perfect multicollinearity.

cor(cbind(x1,x2,x3,y))

            x1          x2         x3          y
x1  1.00000000 -0.07112297  0.9703358  0.7532907
x2 -0.07112297  1.00000000 -0.3101619 -0.2668419
x3  0.97033583 -0.31016190  1.0000000  0.7826352
y   0.75329073 -0.26684195  0.7826352  1.0000000

Let us remove x1 as it is highly correlated with x3. Removing x1 will reduce multicollinearity more as compared to removing x2. Among x3 and x1, x3 has a higher correlation with y, so it is better to remove x1.

model <- lm(y~x2+x3+x4+x5)
vif(model)

      x2       x3       x4       x5 
1.112431 1.128356 1.044870 1.027989

6.4 Addressing multicollinearity in polynomial models

In polynomial models, it is likely that \(X\) will be highly correlated with \(X^p\), resulting in imprecise estimates of the regression coefficents. Let us consider the following simulation example.

# Simulating data
X <- seq(5,8,0.01)
y <- X + X^2 + rnorm(length(X))

model <- lm(y~X+I(X^2))
summary(model)


Call:
lm(formula = y ~ X + I(X^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.88147 -0.66560 -0.00038  0.68735  2.91196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.96675    3.44245   0.862    0.389    
X            0.03921    1.07235   0.037    0.971    
I(X^2)       1.07452    0.08234  13.049   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9647 on 298 degrees of freedom
Multiple R-squared:  0.9938,    Adjusted R-squared:  0.9938 
F-statistic: 2.404e+04 on 2 and 298 DF,  p-value: < 2.2e-16

Note that the \(X\) appears to be statistically insignificant in the model. This may be due to the presence of multicollinearity. Let us compute the correlation between \(X\) and \(X^2\).

cor(X, X^2)

[1] 0.9982179

The high correlation indicates presence of multicollineairty. Multicollinearity can also be observed with the high value of VIF as shown below.

vif(model)

       X   I(X^2) 
280.8106 280.8106

Centering the predictors may help reduce multicollinearity as shown below

x <- X-mean(X)
cor(x,x^2)

[1] 9.0533e-17

The correlation between \(X\) and \(X^2\), after centering the predictor, is almost 0.

model_centered <- lm(y~x+I(x^2))
vif(model_centered)

     x I(x^2) 
     1      1

Note that multicollinearity has been eliminated in the model with centered predictor.

Also, the coefficients of both \(X\) and \(X^2\) are statistically significant (see model summary below) as they should be as per the true model.

summary(model_centered)


Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-2.88147 -0.66560 -0.00038  0.68735  2.91196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48.62000    0.08341  582.93   <2e-16 ***
x           14.00794    0.06399  218.90   <2e-16 ***
I(x^2)       1.07452    0.08234   13.05   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9647 on 298 degrees of freedom
Multiple R-squared:  0.9938,    Adjusted R-squared:  0.9938 
F-statistic: 2.404e+04 on 2 and 298 DF,  p-value: < 2.2e-16

6.5 When can we overlook multicollinearity?

The severity of the problems increases with the degree of the multicollinearity. Therefore, if there is only moderate multicollinearity (5 < VIF < 10), we may overlook it.
Multicollinearity affects only the standard errors of the coefficients of collinear predictors. Therefore, if multicollinearity is not present for the predictors that we are particularly interested in, we may not need to resolve it.
Multicollinearity affects the standard error of the coefficients and thereby their \(p\)-values, but in general, it does not influence the prediction accuracy, except in the case that the coefficients are so unstable that the predictions are outside of the domain space of the response. If our sole aim is prediction, and we don’t wish to infer the statistical significance of predictors, then we may avoid addressing multicollinearity. “The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations, provided these inferences are made within the region of observations” - Neter, John, Michael H. Kutner, Christopher J. Nachtsheim, and William Wasserman. “Applied linear statistical models.” (1996): 318.