Intro to Statistics and Data Science

6 Lab: Linear Model Selection and Regularization

This is a modified version of Lab 1: Subset Selection Methods, Lab 2: Ridge Regression and the Lasso, and Lab 3: PCR and PLS Regression labs from chapter 6 of Introduction to Statistical Learning with Application in R. This version uses tidyverse techniques and methods that will allow for scalability and a more efficient data analytic pipeline.

We will need the packages loaded below.

# Load packages
library(tidyverse)
library(modelr)
library(janitor)
library(skimr)
library(leaps) # best subset selection
library(glmnet) # ridge & lasso
library(glmnetUtils) # improves working with glmnet
library(pls) # pcr and pls

Since we will be making use of validation set and cross-validation techniques to build and select models we should set our seed to ensure the analyses and procedures being performed are reproducible. Readers will be able to reproduce the lab results if they run all code in sequence from start to finish — cannot run code chucks out of order. Setting the seed should occur towards the top of an analytic script, say directly after loading necessary packages.

# Set seed 
set.seed(27182) # Used digits from e

6.1 Data Setup

We will be using the Hitters dataset from the ISLR library. Take a moment and inspect the codebook — ?ISLR::Hitters. Our goal will be to predict salary from the other available predictors.

hitter_dat <- read_csv("data/Hitters.csv") %>% 
  clean_names() %>% 
  rename(c_rbi = crbi) %>% 
  mutate_at(vars(league, division, new_league), factor)

Let’s skim the data to make sure it has been encoded correctly and quickly inspect it for other potential issues such as missing data.

hitter_dat %>% 
  skim_without_charts()
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             322       
## Number of columns          21        
## _______________________              
## Column type frequency:               
##   character                1         
##   factor                   3         
##   numeric                  17        
## ________________________             
## Group variables            None      
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 name                  0             1     7    17     0      322          0
## 
## ── Variable type: factor ───────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique top_counts    
## 1 league                0             1 FALSE          2 A: 175, N: 147
## 2 division              0             1 FALSE          2 W: 165, E: 157
## 3 new_league            0             1 FALSE          2 A: 176, N: 146
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
##    skim_variable n_missing complete_rate    mean      sd    p0   p25    p50    p75  p100
##  1 at_bat                0         1      381.    153.    16   255.   380.   512     687
##  2 hits                  0         1      101.     46.5    1    64     96    137     238
##  3 hm_run                0         1       10.8     8.71   0     4      8     16      40
##  4 runs                  0         1       50.9    26.0    0    30.2   48     69     130
##  5 rbi                   0         1       48.0    26.2    0    28     44     64.8   121
##  6 walks                 0         1       38.7    21.6    0    22     35     53     105
##  7 years                 0         1        7.44    4.93   1     4      6     11      24
##  8 c_at_bat              0         1     2649.   2324.    19   817.  1928   3924.  14053
##  9 c_hits                0         1      718.    654.     4   209    508   1059.   4256
## 10 c_hm_run              0         1       69.5    86.3    0    14     37.5   90     548
## 11 c_runs                0         1      359.    334.     1   100.   247    526.   2165
## 12 c_rbi                 0         1      330.    333.     0    88.8  220.   426.   1659
## 13 c_walks               0         1      260.    267.     0    67.2  170.   339.   1566
## 14 put_outs              0         1      289.    281.     0   109.   212    325    1378
## 15 assists               0         1      107.    137.     0     7     39.5  166     492
## 16 errors                0         1        8.04    6.37   0     3      6     11      32
## 17 salary               59         0.817  536.    451.    67.5 190    425    750    2460

We see there is some missing data/values which are all contained in the response variable salary. There isn’t much we could do here unless we would want to do some research ourselves and see if we can find them somewhere else. We will remove them from our dataset. Note that we could still produce predicted salaries for these players, but we just won’t be able to determine if the predictions are any good.

hitter_dat <- hitter_dat %>% na.omit()

We will be implementing model selection later in this lab and we have a choice which impacts how we need to handle the dataset. We could opt to use \(C_p\), \(BIC\), or \(R^2_{adj}\) or we could use validation set or cross-validation approaches. Remember that \(C_p\), \(BIC\), or \(R^2_{adj}\) are essentially methods that indirectly estimate test \(MSE\) while validation set or cross-validation approaches provide direct estimates of test \(MSE\). The indirect approaches require many assumptions that may not be tenable and they use the entire dataset. We prefer the direct approaches to estimating test \(MSE\) because they are more flexible and computational limitations are no longer a barrier.

# Test set for comparing candidate models -- selecting final model
hitter_mod_comp_dat <- hitter_dat %>% sample_frac(0.15)

# Train set for candidate model building
hitter_mod_bldg_dat <- hitter_dat %>% setdiff(hitter_mod_comp_dat)

6.2 Subset Selection Methods

## Helper functions

# Get predicted values using regsubset object
predict_regsubset <- function(object, fmla , new_data, model_id)
{
  # Not a dataframe? -- handle resample objects/k-folds
  if(!is.data.frame(new_data)){
    new_data <- as_tibble(new_data)
  }
  
  # Get formula
  obj_formula <- as.formula(fmla)
  
  # Extract coefficients for desired model
  coef_vector <- coef(object, model_id)
  
  # Get appropriate feature matrix for new_data
  x_vars <- names(coef_vector)
  mod_mat_new <- model.matrix(obj_formula, new_data)[ , x_vars]
  
  # Get predicted values
  pred <- as.numeric(mod_mat_new %*% coef_vector)
  
  return(pred)
}

# Calculate test MSE on regsubset objects
test_mse_regsubset <- function(object, fmla , test_data){
  
  # Number of models
  num_models <- object %>% summary() %>% pluck("which") %>% dim() %>% .[1]
  
  # Set up storage
  test_mse <- rep(NA, num_models)
  
  # observed targets
  obs_target <- test_data %>% 
    as_tibble() %>% 
    pull(!!as.formula(fmla)[[2]])
  
  # Calculate test MSE for each model class
  for(i in 1:num_models){
    pred <- predict_regsubset(object, fmla, test_data, model_id = i)
    test_mse[i] <- mean((obs_target - pred)^2)
  }
  
  # Return the test errors for each model class
  tibble(model_index = 1:num_models,
         test_mse    = test_mse)
}
# Best susbset: use 10-fold CV to select optimal number of variables
hitter_bestsubset_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(
    fmla = "salary ~ . -name",
    model_fits = map2(fmla, train, 
                      ~ regsubsets(as.formula(.x), data = .y, nvmax = 19)),
    model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
    )

# Forward selection: use 10-fold CV to select optimal number of variables
hitter_fwd_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(
    fmla = "salary ~ . -name",
    model_fits = map2(fmla, train, 
                      ~ regsubsets(as.formula(.x), data = .y, nvmax = 19, method = "forward")),
    model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
    )

# Backward selection: use 10-fold CV to select optimal number of variables
hitter_back_cv <- hitter_mod_bldg_dat %>% 
  crossv_kfold(10, id = "folds") %>% 
  mutate(
    fmla = "salary ~ . -name",
    model_fits = map2(fmla, 
                      train, 
                      ~ regsubsets(as.formula(.x), data = .y, nvmax = 19, method = "backward")),
    model_fold_mse = pmap(list(model_fits, fmla ,test), test_mse_regsubset)
    )

Results for best subset search.

# Plot test MSE (exhaustive search)
hitter_bestsubset_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Plot test MSE (exhaustive search)
hitter_bestsubset_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse)
## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          15  117122.
##  2          14  117193.
##  3           8  117286.
##  4          12  117538.
##  5          17  117947.
##  6          18  118060.
##  7          16  118062.
##  8          19  118214.
##  9           5  119508.
## 10          13  119908.
## 11          10  121000.
## 12          11  121711.
## 13           6  122946.
## 14           7  124173.
## 15           9  126421.
## 16           2  132444.
## 17           4  133917.
## 18           3  145646.
## 19           1  146460.

Results for best foward selection search.

# Plot test MSE (forward search)
hitter_fwd_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Plot test MSE (forward search)
hitter_fwd_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse) 
## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1           5  121038.
##  2           6  121203.
##  3          12  121949.
##  4          13  122314.
##  5          14  122320.
##  6           7  122411.
##  7          11  122449.
##  8           8  122463.
##  9          17  122807.
## 10          15  123045.
## 11          18  123146.
## 12          16  123221.
## 13          19  123566.
## 14           9  124648.
## 15          10  126415.
## 16           4  131604.
## 17           2  137200.
## 18           3  142593.
## 19           1  150856.

Results for best backward selection search.

# Plot test MSE (backward search)
hitter_back_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>% 
  summarise(test_mse = mean(test_mse)) %>% 
  ggplot(aes(model_index, test_mse)) + 
    geom_line()

# Table test MSE (backward search)
hitter_back_cv %>% 
  unnest(model_fold_mse) %>% 
  group_by(model_index) %>%  
  summarise(test_mse = mean(test_mse)) %>% 
  arrange(test_mse) 
## # A tibble: 19 x 2
##    model_index test_mse
##          <int>    <dbl>
##  1          11  122819.
##  2          17  124451.
##  3          12  124701.
##  4          16  124772.
##  5           8  125083.
##  6          10  125138.
##  7          18  125681.
##  8          15  125887.
##  9          19  125931.
## 10          14  126270.
## 11          13  127210.
## 12           7  128654.
## 13           9  130159.
## 14           6  135905.
## 15           5  141200.
## 16           4  155383.
## 17           1  158537.
## 18           2  159953.
## 19           3  170977.

The best subset (exhaustive) method suggests using a model with 8 predictors, forward selection method suggets a model with 5 or 6 predicotrs, and the backward selection method suggest a model with 8 predictors. Now implement each selection method on the entirety of hitter_mod_bldg_dat and extract the best model that uses the indicated number of predictors. These will be 3 of our candidate models.

hitter_regsubsets <- tibble(
  train = hitter_mod_bldg_dat %>% list(),
  test  = hitter_mod_comp_dat %>% list()
  ) %>%
  mutate(
    best_subset = map(train, ~ regsubsets(salary ~ . -name, 
                                          data = .x , nvmax = 19)),
    fwd_selection_5 = map(train, ~ regsubsets(salary ~ . -name, 
                                              data = .x, nvmax = 19, 
                                              method = "forward")),
    fwd_selection_6 = map(train, ~ regsubsets(salary ~ . -name, 
                                              data = .x, nvmax = 19, 
                                              method = "forward")),         
    back_selection = map(train, ~ regsubsets(salary ~ . -name, 
                                             data = .x, nvmax = 19, 
                                             method = "backward"))
    ) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

You’ll note that backward and best subset methods settled on models having 8 predictors. It is possible that they end up producing the same same model (picking the same 8 predictors). We are probably also generally interested to see which predictors each model is using regardless of method we used. Let’s take a look at the estimated coefficients for these four models.

# Inspect/compare model coefficients 
hitter_regsubsets %>% 
  # extract fits
  pluck("fit") %>% 
  # first input is fits; second is model info to extract
  map2(c(8, 5, 6, 8), ~ coef(.x, id = .y)) %>% 
  map2(c("best", "fwd_05", "fwd_06", "back"), ~ enframe(.x, value = .y)) %>% 
  reduce(full_join) %>% 
  knitr::kable(digits = 3)
## Joining, by = "name"
## Joining, by = "name"
## Joining, by = "name"
name best fwd_05 fwd_06 back
(Intercept) 192.149 160.808 156.443 192.149
at_bat -2.549 -1.934 -2.271 -2.549
hits 8.162 8.326 8.585 8.162
walks 6.217 NA 3.191 6.217
c_runs 0.750 NA NA 0.750
c_rbi 0.565 0.702 0.669 0.565
c_walks -0.929 NA NA -0.929
divisionW -140.506 -139.429 -138.490 -140.506
put_outs 0.297 0.315 0.295 0.297

The table above indicates that the best subset and backwards method selected the same model. We also see that the models selected by the forwards method use predictors that are all present in the other model. That is, the models are just constrained versions of the larger model selected by the other methods.

6.3 Ridge Regression & the Lasso

The glmnet package allows users to fit elastic net regressions which includes ridge regression (alpha = 0) and the lasso (alpha =1). Unfortunately the package does not use our familiar formula syntax, y ~ x. It requires the user to encode their data such that the penalized least squares algorithm can be applied. Both lm() and glm() also require the data to be correctly encoded for the least squares algorithm to work but this is done automatically within those functions by using the model.matrix() which takes the inputted formula and appropriately re-encodes the data for the least squares algorithm. glmnet requires the user to use model.matrix() themselves. Wouldn’t it be nice if the glmnet functions did this automatically just like lm() and glm()? Fortunately there is the glmnetUtils package which allows us to keep on using our familiar formula interface with glmnet functions.

The tuning parameter for both ridge regression and the lasso is \(\lambda\), size of the penalty to apply in the least squares algorithm. How do we go about tuning/finding an estimate for the \(\lambda\) parameter? We call upon cross-validation techniques once again. glmnet (glmnetUtils) has a built-in function that will implement cross-validation quickly and efficiently for us. Meaning there is no need for use to use crossv_kfold(). We will be using cv.glmnet() — use ?glmnetUtils::cv.glmnet to access the correct documentation. It is the alpha argument to these functions that determine if we are using ridge regression (alpha = 0) or the lasso (alpha = 1).

Since we are in search of an adequate \(\lambda\) we need to specify a range of values to search. Much like we specified the number of predictors to search over when using regsubset() — started with 1 and went up to 19. The difference here is that \(\lambda\) can be any positive real number (size of penalty to apply). Obviously we cannot use every possible value greater than 0, so we need to setup a grid search. We could go with an automatically selected grid for \(\lambda\) or we can build our own grid. We will demonstrate both.

# lambda grid to search -- use for ridge regression (200 values)
lambda_grid <- 10^seq(-2, 10, length = 200)

# ridge regression: 10-fold cv
ridge_cv <- hitter_mod_bldg_dat %>% 
  cv.glmnet(
    formula = salary ~ . - name, 
    data = ., 
    alpha = 0, 
    nfolds = 10,
    lambda = lambda_grid
    )

The following plot displays the 200 estimated test \(MSE\) values using 10 fold cross validation. The plot indicates the minimum \(\lambda\) (left dotted line). The right dotted line corresponds to a \(\lambda\) value that would result in a more regularized model (forcing more or further shrinkage of coefficients towards zero) and is within one standard error of the minimum \(\lambda\) (model’s are not all that different in estimated test \(MSE\)). Why would we desire a more regularized model? A more regularized model will typically shift more of the prediction burden/importance to fewer predictor variables. The top of the plot displays (periodically) the number of predictors in the models and in the case of ridge regression this number is very unlikely to change since it is highly unlikely to force any of the coefficients to be exactly zero.

# Check plot of cv error
plot(ridge_cv)

Strictly speaking the best \(\lambda\) is the one which minimizes the test error. We will extract that value to use for ridge regression. We will also extract the \(\lambda\) that is within one standard error which can help safe guard against over-fitting issues.

# ridge's best lambdas
ridge_lambda_min <- ridge_cv$lambda.min
ridge_lambda_1se <- ridge_cv$lambda.1se

Let’s get the best \(\lambda\) values for the lasso. Recall that alpha = 1 now and we will let the function select the search grid for \(\lambda\).

# lasso: 10-fold cv
lasso_cv <- hitter_mod_bldg_dat %>% 
  cv.glmnet(
    formula = salary ~ . - name, 
    data = ., 
    alpha = 1, 
    nfolds = 10
    )

Remember that the diffence between this plot and the one above. Notice The number of predictors is decreasing (top of graph) as \(\lambda\) increases. This is an important distinction between the lasso and ridge regression. Lasso forces coefficients to be zero (i.e. it is performing variable selection). For the lasso a more regularized model will result in a more parsimonious (simpler) model. For this reason it is fairly standard practice to use lambda.1se as the best \(\lambda\). This is particularly useful when theory indicates that it is likely a handful of variables are truly related to the response and the others are over-fitting to irreducible error (i.e. won’t be useful for out of sample prediction).

# Check plot of cv error
plot(lasso_cv)

# lasso's best lambdas
lasso_lambda_1se <- lasso_cv$lambda.1se
lasso_lambda_min <- lasso_cv$lambda.min

Now that we have identified the best \(\lambda\)’s for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be our candidate models produced by ridge regression and the lasso. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.

hitter_glmnet <- tibble(
  train = hitter_mod_bldg_dat %>% list(),
  test  = hitter_mod_comp_dat %>% list()
  ) %>%
  mutate(
    ridge_min = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                    alpha = 0, lambda = ridge_lambda_min)),
    ridge_1se = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                    alpha = 0, lambda = ridge_lambda_1se)),
    lasso_min = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                    alpha = 1, lambda = lasso_lambda_min)),
    lasso_1se = map(train, ~ glmnet(salary ~ . -name, data = .x,
                                    alpha = 1, lambda = lasso_lambda_1se))
    ) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

Let’s take a look at the coefficients for the candidate models produced by ridge regression and the lasso.

# Inspect/compare model coefficients 
hitter_glmnet %>% 
  pluck("fit") %>% 
  map( ~ coef(.x) %>% 
         as.matrix() %>% 
         as.data.frame() %>% 
         rownames_to_column("name")) %>%
  reduce(full_join, by = "name") %>% 
  mutate_if(is.double, ~ if_else(. == 0, NA_real_, .)) %>% 
  rename(ridge_min = s0.x,
         ridge_1se = s0.y,
         lasso_min = s0.x.x,
         lasso_1se = s0.y.y) %>% 
  knitr::kable(digits = 3)
name ridge_min ridge_1se lasso_min lasso_1se
(Intercept) 128.869 243.584 111.713 175.101
at_bat -1.373 0.074 -2.111 NA
hits 4.554 0.318 7.066 1.352
hm_run -5.922 0.788 -2.658 NA
runs 0.517 0.476 NA NA
rbi 1.621 0.505 0.265 NA
walks 3.929 0.586 4.880 0.382
years -14.477 2.362 -14.301 NA
c_at_bat -0.007 0.007 NA NA
c_hits 0.181 0.026 NA NA
c_hm_run 0.616 0.189 NA NA
c_runs 0.330 0.050 0.603 0.009
c_rbi 0.426 0.055 0.739 0.471
c_walks -0.474 0.054 -0.662 NA
leagueA -37.974 -2.640 -45.825 NA
leagueN 38.034 2.640 0.000 NA
divisionE 72.204 18.242 138.364 4.839
divisionW -72.092 -18.242 NA NA
put_outs 0.302 0.046 0.301 0.087
assists 0.231 0.005 0.235 NA
errors -4.729 -0.081 -3.159 NA
new_leagueA 22.324 -2.639 12.415 NA
new_leagueN -22.244 2.639 0.000 NA

6.4 PCR & PLS Regression

Will will be using pcr() from the pls library to implement principal component regression. Recall the purpose of PCR is to reduce the dimension of our predictor set. That is we want to find a set of \(m\) principal components (linear combinations of predictors) that we will then use in a linear regression. This is to solve the problem of having too many predictors relative to observations, especially when the number of predictors exceeds that of our observations (\(p > n\)).

Hopefully it is clear that \(m\), the number of principal components, is just a tuning parameter. Meaning we can once again turn to cross-validation techniques to pick the optimal \(m\), optimal in that we want the lowest possible test error. We can set validation = "CV" in pcr() to conduct 10-fold cross-validation by default. Note that you can control the number by specifying, for example, segments = 5 for 5-fold cross-validation.

Before we proceed to tuning the \(m\) parameter there is one more important decision to be made. Do we want to standardize each predictor variable prior to forming the principal components? This is a very important decision because forming principal components is sensitive to scale and standardizing each variable forces them to have the same scale. If we choose to standardize the predictors than we are saying they are all equally important and none should be prioritized over the other due to how it is measured. If not, we are indicating that the scale of the predictor variables is important and should be taken into account when forming principal components. In most cases we will want scale = TRUE because we will have many variables measured on many different scales with no reasonable justification to believe that the scale of the measure should be important. For example a variable such as height does not become more useful when measured in centimeters, then measured in meters.

# pcr: 10-fold cv
pcr_cv <- hitter_mod_bldg_dat %>% 
  pcr(salary ~ . -name, data = ., scale = TRUE, validation = "CV")

We can use validationplot() to find the \(m\) (number of components) that corresponds to the lowest estimated test error (either reported as root mean squared error or mean squared error). The plots below indicate that 6 would be the optimal number of principal components — added a vertical line at 6 to help identify.

# Using root mean squared error
validationplot(pcr_cv)
abline(v = 7) # add vertical line at 7

# Using mean squared error
validationplot(pcr_cv, val.type="MSEP")
abline(v = 7) # add vertical line at 7

If it becomes too difficult to determine the optimal value using a validationplot() then use summary() to get a table of the values.

pcr_cv %>% 
  summary()
## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
## CV           450.9    358.7    358.7    357.3    353.6    351.5    350.1    347.9    349.9    349.9
## adjCV        450.9    358.1    358.1    356.7    352.9    350.9    348.9    346.8    348.7    348.7
##        10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
## CV        354.2     356.2     356.8     367.7     362.7     353.9     340.3     341.9     339.5
## adjCV     352.5     354.6     355.0     365.4     359.4     351.5     338.0     339.4     337.0
##        19 comps
## CV        344.4
## adjCV     341.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps
## X         38.69    59.96    71.03    79.41    84.48    88.82    92.52    95.18    96.39     97.37
## salary    39.53    39.92    40.63    42.33    43.93    46.04    46.99    47.14    47.30     48.42
##         11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X          98.03     98.66     99.17     99.47     99.75     99.90     99.97     99.99     100.0
## salary     48.54     48.71     49.16     52.12     52.80     55.97     56.39     56.89      56.9

Let’s move on to calculating the optimal \(m\) when using plsr(), partial least squares regression. Note that \(m\) stands for the number of partial least squares directions (much like the number of principal components).

# pls: 10-fold cv
pls_cv <- hitter_mod_bldg_dat %>% 
  plsr(salary ~ . -name, data = ., scale = TRUE, validation = "CV")

Looks like 4 partial least squares directions produce the lowest cross-validation error while keeping siginficantly reducing the dimensionality of our predictor set. Could make use of 8 partial least squares directions.

validationplot(pls_cv, val.type = "MSEP")
abline(v = 4)
abline(v = 8)

There doesn’t appear to be much difference between using 2, 3, or 4 partial least squares directions either (see plot above and table below). The fewer directions, the better because the purpose of partial least squares regression is to reduce the dimension of our predictor space. Let’s also plan on proposing a candidate model with 2 parial leat squares directions.

pls_cv %>% 
  summary()
## Data:    X dimension: 224 19 
##  Y dimension: 224 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
## CV           450.9    349.9    346.2    345.4    345.0    350.5    349.0    345.1    339.5    339.4
## adjCV        450.9    349.5    345.2    344.4    343.8    348.4    346.4    342.8    337.3    337.2
##        10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps
## CV        339.3       336     337.9     340.3     341.4     341.0     341.0     340.8     340.9
## adjCV     337.1       334     335.7     337.9     339.0     338.5     338.5     338.3     338.4
##        19 comps
## CV        343.6
## adjCV     340.9
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps
## X         38.48    48.29    64.29    73.60    79.24    84.54    88.19    90.19    93.13     95.72
## salary    42.08    47.15    48.54    49.83    52.00    53.79    54.68    55.73    56.04     56.21
##         11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## X          96.94     97.73     98.36     98.67     99.02     99.23     99.66     99.99     100.0
## salary     56.45     56.58     56.66     56.78     56.85     56.88     56.89     56.90      56.9

Now that we have identified the best \(m\) for each method we need to fit them to the entire model building dataset hitter_mod_bldg_dat. These will be the candidate models produced by principal component regression and partial least squares regression. We do this within a tibble framework so that later we are able to calculate each model’s test error (test mse) on hitter_mod_comp_dat. This will allow us to easily compare all of our candidate models against each other in order to pick a final prediction model.

hitter_dim_reduct <- tibble(
  train = hitter_mod_bldg_dat %>% list(),
  test  = hitter_mod_comp_dat %>% list()
  ) %>%
  mutate(
    pcr_7m = map(train, ~ pcr(salary ~ . -name, data = .x, ncomp = 7)),
    pls_8m = map(train, ~ plsr(salary ~ . -name, data = .x, ncomp = 8)),
    pls_4m = map(train, ~ plsr(salary ~ . -name, data = .x, ncomp = 4)),
    pls_2m = map(train, ~ plsr(salary ~ . -name, data = .x, ncomp = 2))
    ) %>% 
  pivot_longer(cols = c(-test, -train), names_to = "method", values_to = "fit")

6.5 Final Model Selection

We will need to calculate the test error for each candidate model produced by the several methods we have previously fit.

# Test error for regsubset fits
regsubset_error <- hitter_regsubsets %>% 
  mutate(test_mse = map2(fit, test, ~ test_mse_regsubset(.x, salary ~ . -name, .y))) %>% 
  unnest(test_mse) %>% 
  group_by(method) %>% 
  filter(model_index == max(model_index)) %>% 
  select(method, test_mse) %>% 
  ungroup()

# Test error for ridge and lasso fits
glmnet_error <- hitter_glmnet %>% 
  mutate(pred = map2(fit, test, predict),
         test_mse = map2_dbl(test, pred, ~ mean((.x$salary - .y)^2))) %>% 
  unnest(test_mse) %>% 
  select(method, test_mse)

# Test error for pcr and plsr fits
dim_reduce_error <- hitter_dim_reduct %>% 
  mutate(pred = pmap(list(fit, test, c(7, 8, 4, 2)), predict),
         test_mse = map2_dbl(test, pred, ~ mean((.x$salary - .y)^2))) %>% 
  unnest(test_mse)  %>% 
  select(method, test_mse)

# Test errors ccombined and organzied
regsubset_error %>% 
  bind_rows(glmnet_error) %>% 
  bind_rows(dim_reduce_error) %>% 
  arrange(test_mse) %>%
  knitr::kable(digits = 2)
method test_mse
pls_4m 112607.6
pcr_7m 114341.1
ridge_min 131184.2
lasso_min 133106.5
lasso_1se 134645.3
pls_8m 135032.4
fwd_selection_5 136732.3
fwd_selection_6 136732.3
back_selection 136732.3
best_subset 136732.3
ridge_1se 140772.4
pls_2m 143780.4

Appears that the model build using PLS with 4 partial least squares directions is the best (lowest test mse). Though PCR with 7 components isn’t too far behind. If we are hoping to have a model that we might be able to more easily interpret then the model with 8 predictors selected by both backwards and best subset selection would be useful.