5 Lab: Cross-Validation and the Bootstrap
This is a modified version of the Lab: Cross-Validation and the Bootstrap section of chapter 5 from Introduction to Statistical Learning with Application in R. This version uses tidyverse techniques and methods that will allow for scalability and a more efficient data analytic pipeline.
We will need the packages loaded below.
# Load packages
library(tidyverse)
library(janitor)
library(skimr)
library(modelr)
Whenever performing analyses or processes that include randomization or resampling it is considered best practice to set the seed of your software’s random number generator. This is done in R using set.seed()
. This ensure the analyses and procedures being performed are reproducible. For instance, readers following along in the lab will be able to produce the precisely the same results as those produced in the lab. Provided they run all code in the lab in sequence from start to finish — cannot run code chucks out of order. Setting the seed should occur towards the top of an analytic script, say directly after loading necessary packages.
# Set seed
set.seed(27182) # Used digits from e
5.1 Validation Set Approach
We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto
dataset. This dataset is from the ISLR
library. Take a moment and inspect the codebook — ?ISLR::Auto
. We will read in the data from the Auto.csv
file and process do a little processing.
auto_dat <- read_csv("data/Auto.csv") %>%
clean_names()
We begin by using the sample_frac()
function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.
auto_validation <- tibble(train = auto_dat %>% sample_frac(0.5) %>% list(),
test = auto_dat %>% setdiff(train) %>% list())
Let’s keep it relatively simple and fit a simple linear regression using horsepower
to predict mpg
and polynomial regressions of up to degree 5 of horsepower
to predict mpg
.
# Setup tibble with model names and formulas
model_def <- tibble(degree = 1:10,
fmla = str_c("mpg ~ poly(horsepower, ", degree, ")"))
# Combine validation setup with model fitting info
auto_validation <- auto_validation %>%
crossing(model_def)
# Add model fits and assessment
auto_validation <- auto_validation %>%
mutate(model_fit = map2(fmla, train, lm),
test_mse = map2_dbl(model_fit, test, mse))
auto_validation %>%
select(degree, test_mse) %>%
arrange(test_mse)
## # A tibble: 10 x 2
## degree test_mse
## <int> <dbl>
## 1 2 18.6
## 2 4 18.9
## 3 3 18.9
## 4 7 19.0
## 5 6 19.1
## 6 8 19.3
## 7 5 19.3
## 8 9 21.4
## 9 1 22.1
## 10 10 22.6
auto_validation %>%
select(degree, test_mse) %>%
ggplot(aes(x = degree, y = test_mse)) +
geom_line()
5.2 Leave-One-Out-Cross Validation
auto_loocv <- auto_dat %>%
crossv_loo(id = "fold") %>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble)
)
## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
auto_loocv <- auto_loocv %>%
crossing(model_def) %>%
mutate(
model_fit = map2(fmla, train, lm),
fold_mse = map2_dbl(model_fit, test, mse)
)
auto_loocv %>%
group_by(degree) %>%
summarise(test_mse = mean(fold_mse)) %>%
arrange(test_mse)
## # A tibble: 10 x 2
## degree test_mse
## <int> <dbl>
## 1 7 18.8
## 2 8 19.0
## 3 6 19.0
## 4 5 19.0
## 5 9 19.1
## 6 2 19.2
## 7 3 19.3
## 8 4 19.4
## 9 10 19.5
## 10 1 24.2
auto_loocv %>%
group_by(degree) %>%
summarise(test_mse = mean(fold_mse)) %>%
ggplot(aes(x = degree, y = test_mse)) +
geom_line()
5.3 \(k\)-fold Cross-Validation
auto_10fold <- auto_dat %>%
crossv_kfold(10, id = "fold") %>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble)
)
auto_10fold <- auto_10fold %>%
crossing(model_def) %>%
mutate(model_fit = map2(fmla, train, lm),
fold_mse = map2_dbl(model_fit, test, mse))
auto_10fold %>%
ggplot(aes(x = degree, y = fold_mse, color = fold)) +
geom_line()
auto_10fold %>%
group_by(degree) %>%
summarize(test_mse = mean(fold_mse)) %>%
ggplot(aes(x = degree, y = test_mse)) +
geom_line() +
geom_point()
5.4 The Bootstrap
We will we using the Portfolio
dataset from ISLR
— see ?ISLR::Portfolio
for details. We will load the dataset from the Portfolio.csv
file.
portfolio_dat <- read_csv("data/Portfolio.csv") %>% clean_names()
portfolio_dat %>%
skim_without_charts()
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 100
## Number of columns 2
## _______________________
## Column type frequency:
## numeric 2
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
## 1 x 0 1 -0.0771 1.06 -2.43 -0.888 -0.269 0.558 2.46
## 2 y 0 1 -0.0969 1.14 -2.73 -0.886 -0.229 0.807 2.57
5.4.1 Estimating the Accuracy of a Statistic of Interest
# Statistic of interest
alpha_fn <- function(resample_obj){
resample_obj %>%
# turn resample object into dataset
as_tibble() %>%
summarise(alpha = (var(y)-cov(x,y))/(var(x)+var(y)-2*cov(x,y))) %>%
pull(alpha)
}
# Estimate on original data
original_est <- portfolio_dat %>%
alpha_fn()
original_est
## [1] 0.5758321
# create 1000 bootstrap estimates
portfolio_boot <- portfolio_dat %>%
bootstrap(1000, id = "boot_id") %>%
mutate(alpha_boot = map_dbl(strap, alpha_fn))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Summary of bootstrap estimates
portfolio_boot %>%
select(alpha_boot) %>%
skim_without_charts()
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 1000
## Number of columns 1
## _______________________
## Column type frequency:
## numeric 1
## ________________________
## Group variables None
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
## 1 alpha_boot 0 1 0.577 0.0881 0.317 0.513 0.573 0.633 0.860
# Boxplot of bootstrap estimates with estimate from original data (red X)
portfolio_boot %>%
ggplot(aes(x = "alpha_boot" , y = alpha_boot)) +
geom_boxplot() +
geom_point(aes(x = "alpha_boot", y = original_est),
shape = 4, size = 4, color = "red") +
coord_flip()
# Histogram of bootstrap estimates with estimate from original data (red dashed line)
portfolio_boot %>%
ggplot(aes(x = alpha_boot)) +
geom_histogram(bins = 25, color = "white") +
geom_vline(aes(xintercept = original_est),
color = "red", linetype = "dashed")
# Bootstrap estimate of standard error for estimator
portfolio_boot %>%
summarise(est_boot = original_est,
est_se = sd(alpha_boot))
## # A tibble: 1 x 2
## est_boot est_se
## <dbl> <dbl>
## 1 0.576 0.0881
5.4.2 Estimating the Accuracy of a Linear Regression Model
# 1000 Bootstraps, fit models, get parameter estimates
auto_boot <- auto_dat %>%
bootstrap(1000, id = "boot_id") %>%
mutate(model_fit = map(strap, lm, formula = mpg ~ horsepower),
mod_tidy = map(model_fit, broom::tidy))
# 1000 Bootstraps, fit models, get parameter estimates
auto_boot <- auto_dat %>%
modelr::bootstrap(1000, id = "boot_id") %>%
mutate(model_fit = map(strap, lm, formula = mpg ~ horsepower),
mod_tidy = map(model_fit, broom::tidy))
# Examine bootstrap coefficient estimates
auto_boot %>%
unnest(mod_tidy) %>%
group_by(term) %>%
select(term, estimate) %>%
skim_without_charts()
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 2000
## Number of columns 2
## _______________________
## Column type frequency:
## numeric 1
## ________________________
## Group variables term
##
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────
## skim_variable term n_missing complete_rate mean sd p0 p25 p50 p75
## 1 estimate (Intercept) 0 1 40.0 0.866 37.3 39.4 39.9 40.6
## 2 estimate horsepower 0 1 -0.158 0.00743 -0.184 -0.163 -0.158 -0.153
## p100
## 1 43.1
## 2 -0.136
# Histogram of bootstrap coefficient estimates
auto_boot %>%
unnest(mod_tidy) %>%
ggplot(aes(x = estimate)) +
geom_histogram(color = "white", bins = 25) +
facet_wrap(. ~ term, scale = "free_x")
# Boxplot of bootstrap coefficient estimates
auto_boot %>%
unnest(mod_tidy) %>%
ggplot(aes(x = "", y = estimate)) +
geom_boxplot() +
facet_wrap(. ~ term, scale = "free_y")
# Estimates using original data (including SE)
auto_dat %>%
lm(mpg ~ horsepower, data = .) %>%
broom::tidy() %>%
select(-statistic, -p.value)
## # A tibble: 2 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) 39.9 0.717
## 2 horsepower -0.158 0.00645
# Estimates se using bootstrap
auto_boot %>%
unnest(mod_tidy) %>%
group_by(term) %>%
summarise(est_se = sd(estimate))
## # A tibble: 2 x 2
## term est_se
## <chr> <dbl>
## 1 (Intercept) 0.866
## 2 horsepower 0.00743