# Importing libraries
rm(list=ls())
library(ISLR)
library(leaps)
library(car)
9 Outliers, High leverage and Influential Points
Let us again consider the Hitters
dataset to identify outliers, high leverage, and influential points. Recall the meaning of these terms:
1. Outliers: These are outlying observations with respect to the response (\(Y\)). If there is a relatively large gap between the prediction and the observed response, the observation may be an outlier. Or, in other words, observations that correspond to large residuals are likely to be outliers.
2. High Leverage points: These are outlying observations with respect to the predictors (\(X\)). If there is a relatively large gap between an observation and the center of the rest of the observations in terms of predictor values, then the observation has a high leverage in influencing the regression function. However, it may or may not influence it. It will influence it only if the observation is an outlier as well, i.e., outlying observation with respect to the response (\(Y\)) as well.
3. Influential points: Observations that are both outliers and having high leverage tend to influence the regression function. Such observations should be analyzed, and a decision should be made whether they should be included in model fitting or not. If they appear to be incorrect, then they may be discarded.
We will read the Hitters
dataset, and use the best model based on the \(BIC\) criterion as identified in the previous chapter.
# Reading data
= Hitters
data
# Removing missing values
<- data[complete.cases(data),]
data
# Identifying the predictors of the best model based on on the BIC criterion
<- regsubsets(Salary ~ ., data = Hitters,
regfit.full nvmax = 19)
<- summary(regfit.full)
reg.summary coef(regfit.full, which.min(reg.summary$bic))
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076
Let us fit the best model based on the above predictors.
# Fitting the best model
<- lm(Salary~AtBat+Hits+Walks+CRBI+Division+PutOuts, data = data) model
9.1 Outliers
Let us visualize the deleted studentized residuals to identify outliers.
# Number of observations
= nrow(data)
n
# Number of predictors
= 7 # including the intercept p
# Studentized deleted residuals
plot(1:n, rstudent(model), type = "l")
text(1:n, rstudent(model))
We’ll use the Bonferroni test to identify the value of the studentized deleted residual over which an observation will be considered an outlier
# Critical value for the residual to be identified as an outlier
<- 0.05
alpha <- qt(1-alpha/2/n, n-p-1) #Bonferroni
crit which(abs(rstudent(model)) >=crit )
-Mike Schmidt
173
We have only one outlying observation, which is the \(173rd\) observation in the data (also shown below).
# Outliers
plot(1:n, rstudent(model), type = "l",
ylab = "Deleted studentized residual", xlab = "Index")
text(1:n, rstudent(model))
abline(h=crit, col = 'red')
Let us analyze this observation.
173,] data[
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
-Mike Schmidt 20 1 0 0 0 0 2 41 9 2 6
CRBI CWalks League Division PutOuts Assists Errors Salary
-Mike Schmidt 7 4 N E 78 220 6 2127.333
NewLeague
-Mike Schmidt N
This observation is for Mike Schmidt whose salary is $2127. However, his predicted salary based on the model is $87 (shown below):
predict(model, newdata = data[173,])
-Mike Schmidt
86.86152
The observation is an outlier because players similar to Mike Schmidt, in terms of AtBat
, Hits
, Walks
, CRBI
, Division
, and PutOuts
have a low salary. Let us see the salary of players similar to Mike Schmidt.
<- data[,c('AtBat','Hits','Walks','CRBI','PutOuts')]
std_data <- (std_data-apply(std_data,2,mean))/apply(std_data,2,sd)
std_data <- sqrt(apply(((std_data[rep(173,n),c('AtBat','Hits','Walks','CRBI','PutOuts')] - std_data[,c('AtBat','Hits','Walks','CRBI','PutOuts')])^2),1,sum))
distance <- cbind(data, distance, predicted_salary=model$fitted.values)
data_distance order(by = distance),][1:5,c('AtBat','Hits','Walks','CRBI','PutOuts','Salary','predicted_salary')] data_distance[
AtBat Hits Walks CRBI PutOuts Salary predicted_salary
-Mike Schmidt 20 1 0 7 78 2127.333 86.86152
-Bruce Bochy 127 32 14 82 202 135.000 132.47474
-Dale Sveum 317 78 32 35 45 70.000 245.03616
-Curt Ford 214 53 23 32 109 70.000 229.09872
-Kurt Stillwell 279 64 30 26 107 75.000 89.83407
We can see salary of players having the most similar statistics to Mike Schmidt is much lower than him.
Let us see the players for which the predicted salary is more than $1500
cbind(data[model$fitted.values>1500,c('AtBat','Hits','Walks','CRBI','PutOuts','Salary')],predicted_salary = model$fitted.values[model$fitted.values>1500])
AtBat Hits Walks CRBI PutOuts Salary predicted_salary
-Eddie Murray 495 151 78 1015 1045 2460.0 1532.104
-Jim Rice 618 200 62 1289 330 2412.5 1602.928
-Keith Hernandez 551 171 94 900 1199 1800.0 1605.470
Note that these players have statistics much different than Mike Schmidt. Thus, Mike Schmidt is an outlier with respect to the response.
Note the model coefficients to see the effect of statistics on predicted salary.
$coefficients model
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076
The outlying observation can also be seen clearly if we plot the fitted values against salary.
plot(data$Salary, model$fitted.values, col = 'orange', pch = 16, main = "Fitted vs true values")
points(data$Salary[173], model$fitted.values[173], col = 'red', pch = 19)
text(data$Salary[173], model$fitted.values[173], "Mike Schmidt", pos = 3)
abline(0,1, col = 'blue')
9.2 Leverage
Let us visualize the leverage of all observations.
# Leverage
plot(1:n, hatvalues(model), type = "l")
text(1:n, hatvalues(model))
# Average leverage
/n p
[1] 0.02661597
Observations whose leverage is more than twice as large as the mean leverage can be considered as high leverage.
sum(hatvalues(model)>2*p/n)
[1] 24
We have 24 observations that can be considered to have a high leverage (visualized below).
# High leverage
plot(1:n, hatvalues(model), type = "l")
text(1:n, hatvalues(model))
abline(h=2*p/n, col = 'red')
9.3 Influential observations
Observations that have a high leverage and are outliers are influential. The degree of influence depends on the leverage and the magnitude of the residual.
9.3.1 Cook’s distance
Let us visualize influential observations based on Cook’s distance.
plot(1:n, cooks.distance(model), type = "l")
text(1:n, cooks.distance(model))
If Cook’s distance for an observation is at 50th or higher percentile of \(F(p,n-p)\) distribution, then the observation is highly influential, while if it is less than 20th percentile, then it is not influential.
#50th percentile value of F(p,n-p) distribution
qf(0.5, 7, n-7)
[1] 0.9089314
#20th percentile value of F(p,n-p) distribution
qf(0.2, 7, n-7)
[1] 0.5448048
As the highest Cook’s distance in our observations is 0.2, we don’t have any influential observations. However, the 173rd observation is relatively more influential than the rest.
As the influence of an observation depends on its residual, and leverage, a plot of residual against leverage also indicates influence of observations. In the plot below, the size of the points is proportional to the influence.
plot(hatvalues(model), rstudent(model), cex = 10*cooks.distance(model), ylab = "Deleted studentized residual", xlab = "Leverage")
abline(h=0,col=2)
We can see that the observation with the deleted studentized residual of more than 6 is the most influential. Let us compare the model fit with and without this observation.
<- lm(Salary~AtBat+Hits+Walks+CRBI+Division+PutOuts, data = data[-173,]) model_minus173
# Original model
summary(model)$r.squared
[1] 0.5087146
summary(model)$adj.r.squared
[1] 0.4972001
# Model after removing the 173rd observation
summary(model_minus173)$r.squared
[1] 0.569231
summary(model_minus173)$adj.r.squared
[1] 0.5590953
As expected, the model fit does seem to improve if we remove the outlier, as the outlier corresponds to a high prediction error, and thereby a high unexplained variation.
However, as the 173rd observations is not influential, the regression function should not change substantially. Let us compute the average percentage change in fitted values if we fit the model without the 173rd observation.
# fitted values using all observations
<- model$fitted.values
pred_all # fitted values using all but the 173rd observation
<- model_minus173$fitted.values
pred_all_minus173
mean(abs((pred_all_minus173-pred_all[-173])/pred_all[-173]*100))
[1] 6.999187
The average change in predictions is 7%, if the 173rd observation is removed.
9.3.2 DFFITS
Let us visualize the influence of an observation on the prediction for that observation.
plot(1:n, dffits(model), type="l")
text(1:n, dffits(model))
For the 173rd observation, \(DFFITS\) is more than one indicating a large influence on the prediction (or fitted value) for that observation.
# Prediction on the 173rd observation with model fitted on all data
predict(model, newdata = data[173,])
-Mike Schmidt
86.86152
# Prediction on the 173rd observation with model fitted on all data except the 173rd observation
predict(model_minus173, newdata = data[173,])
-Mike Schmidt
9.292043
The change in the predicted value is 89% if the 173rd observation is removed from the data. Although the 173rd observation doesn’t have a high influence overall, it does have a high influence on prediction at the point itself.
9.3.3 DFBETAS
Let us observe the influence of each observation on the regression coefficients.
Influence on the intercept:
# Intercept is the 1st predictor
plot(1:n, dfbetas(model)[,1], type="l")
text(1:n, dfbetas(model)[,1])
As DFBETAS for the 173rd observation is greater than one, the observation has a high influence on the intercept term
$coefficients model
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076
$coefficients model_minus173
(Intercept) AtBat Hits Walks CRBI DivisionW
10.1309202 -1.6581831 7.3335069 3.8350690 0.6530299 -104.5410980
PutOuts
0.2617957
The intercept changes by 89% from $91 to $10 when the 173rd observation is removed.
Influence on the AtBat
:
# AtBat is the 2nd predictor
plot(1:n, dfbetas(model)[,2], type="l")
text(1:n, dfbetas(model)[,2])
None of the observations have a high influence on the coefficient of AtBat
.
Similary, we can visualize the influence of observations on all the regression coefficients.