Intro to Statistics and Data Science

Chapter 13 Putting it all together

Over the course of this book, you have been introduced to methods for data cleaning, visualization, and analysis (Part I) and the underlying theory of estimation, generalizability, and causal inference (Parts II & III). In this chapter, we put all of these methods and theory together, illustrating how descriptive and inferential statistics can be used with real data.

Needed Packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages.

library(tidyverse)

13.1 A general process for using statistics

In general, there are two types of analyses that statistics are used for: Exploratory and Confirmatory.

Exploratory research typically doesn’t start with a clear question – it starts with data. For example, maybe you discover that it is possible to download data regarding all of the movies found in IMDB. In this case, just as in the beginning of this book, you explore the data visually and using summary statistics, maybe even comparing groups, variables, and relationships between them. Only after you’ve explored do you have a clear sense of some questions that you might want to ask, like: do action films gross more, on average, than comedies? Has this trend changed over time? In many cases, these questions can be answered just using descriptive statistics.

Confirmatory research starts with a clear question. This research typically builds on previous, exploratory research, and may involve the collection of data – e.g., via a survey or an experiment. In confirmatory research, after clearly defining the questions, you need to determine how data will be used to answer to these questions. For example, perhaps you want to know if allowing students to use laptops in class increases (or reduces) learning. To answer this, you randomly assign students to use laptops or not in class, and at the end of the semester you compare grades between the two groups. Determining if there is a difference here requires inferential statistics – e.g., p-values and hypothesis tests.

In either case, you need to pay attention to issues regarding causality and generalizability. When you use descriptive statistics, you are limiting your generalizations to the sample, not making any claims beyond the data you have in front of you. When you use inferential statistics, you are implicitly generalizing to other samples like the one that you have – i.e., gathered in the same way. But even then, this population needs to be clearly defined, indicating where the results generalize and where they do not. Similarly, if your question involves comparisons between groups or relationships between variables, you need to determine if you can (not just if you want to) infer causality from your data.

In the remainder of this chapter, we provide three examples illustrating confirmatory research questions. These complement the data analyses provided in the first part of the book that focus on exploratory research.

13.2 Example: Treatment effect

The Tennessee STAR experiment was conducted in the 1980s in order to determine if class size effects student learning. In each school in the experiment, Kindergarten students were randomized to either be in a small classroom (13-17 students) or a regular sized classroom (22-27 students). This design was replicated in about 80 schools throughout the state. For information on the study, see here.

For this example, we focus on data from a single school. In this school, there were 137 Kindergarten students that were randomly divided into 7 classes: 3 “small” classes and 4 “regular” classes. Teachers were also randomly assigned to these classrooms. At the end of the year, students were tested and achievement scores obtained. Let's load this data and take a look at it.

star_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTluVgdD7hdPLkKPMpN3OZjpBdZ1HloCGEQ1abcgfcWJHGYQppUGyXsGaPz74XQRwaMobDFLZgN3caU/pub?gid=0&single=true&output=csv")
glimpse(star_data)
Rows: 137
Columns: 6
$ school_id <dbl> 169229, 169229, 169229, 169229, 169229, 169229, 169229, 169…
$ class_id  <dbl> 16922901, 16922901, 16922901, 16922901, 16922901, 16922901,…
$ reading   <dbl> 461, 527, 500, 474, 500, 545, 451, 442, 472, 460, 470, 463,…
$ math      <dbl> 559, 547, 602, 478, 547, 602, 478, 513, 520, 520, 506, 559,…
$ class_si  <dbl> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
$ small     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…

This dataset includes 6 variables:

  • school_id - school ID, which is the same for every student in the dataset, since we're only looking at one school
  • class_id - indicates which of the 7 classes the student belongs to
  • reading - end-of-year reading achievement score
  • math - end-of-year math achievement score
  • class_si - number of students in the classroom
  • small - indicator variable for whether or not the class is "small" or "regular" (1 = small, 0 = regular)

Question: Did the type of class (small, regular) affect student math achievement, on average, in this school?

Model: To answer this, we can use a regression model, treating small as a dummy variable:

\[\widehat{math} = b_0 + b_1*small\]

Null Hypothesis: We can use a stochastic proof by contradiction in order to prove that there is a difference. To do so, we begin by assuming that there is no difference (our null) and then look to see if our data shows a difference large enough to contradict this.

\[H_0: \beta_1 = 0\] \[H_A: \beta_1 \neq 0\]

Notice here that the population parameter we are interested in is \(\beta_1\), which we estimate by the sample statistic \(b_1\) in our model. Because the treatment of having small class sizes is fairly costly (as it would require hiring and paying more teachers), we want to make sure we have pretty strong evidence that it is effective before recommending that it should be adopted as a policy. Therefore, we design our test to have a Type I error of \(\alpha = 0.01\). That is, we are only willing to tolerate a 1% probability of falsely finding the treatment to be effective when in fact it is not, so we will only reject the null hypothesis if \(p < 0.01.\)

Estimate: Our estimated model is found in the table below.

star_model <- lm(math ~ small, data = star_data)
summary(star_model)

Call:
lm(formula = math ~ small, data = star_data)

Residuals:
    Min      1Q  Median      3Q     Max 
 -99.99  -27.45   -2.99   28.01  134.01 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   491.99       4.68  105.23   <2e-16 ***
small          19.46       7.82    2.49    0.014 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43.9 on 135 degrees of freedom
Multiple R-squared:  0.0439,    Adjusted R-squared:  0.0368 
F-statistic:  6.2 on 1 and 135 DF,  p-value: 0.014

In this output, we can see that on average, students in small classes scored 19.46 points more than students in the regular-sized classes. We can use the values from the model output to compute the statistic \[test\_stat = \frac{Estimate - Null \ \ value}{SE(Estimate)} = \frac{b_1 - 0}{SE(b_1)} = \frac{19.46}{7.82} = 2.49,\] which will help us to determine the probability that we would see a value this large if in fact there was no effect of class size. The p-value of \(0.014\) reported in the regression output indicates that we would see an effect this large if there was actually no effect in about 1.4% of samples. Note this is the same p-value we would obtain by using the pt() function, our t value (i.e. \(test\_stat\)), and the appropriate degrees of freedom: pt(2.49, df = (137 - 2), lower.tail = FALSE)*2 = 0.014.

Conclusion: Since the p-value is 1.4%, which is greater than 1% (\(p > \alpha\)), we fail to reject our null hypothesis and therefore conclude that there is insufficient evidence that class-size reduction increased learning. Note that because this school was not randomly selected from a population, it is difficult to generalize the results beyond this school.

13.3 Example: Estimate a proportion

The General Social Survey is an annual probability survey of American adults. Randomly selected adults are contacted via telephone and asked questions regarding their attitudes and experiences. In 2002, one question asked,

“People differ in their ideas about what it takes for a young person to become an adult these days. How important is it for them to be no longer living in their parents' household?”

The GSS showed that 29% of those asked felt that it was “Extremely important” for adults to no longer live with their parents. You can see this data here.

In 2008, there was an economic downturn and increasing numbers of young adults returned to living with their parents. This led a researcher interested in understanding changing attitudes to ask: Are the attitudes among Beta University students in 2019 different than these trends in 2002? In order to answer this, the researcher conducted a survey of 100 students at the university. Let's take a look at this data.

Rows: 100
Columns: 1
$ important <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,…

The variable important has a value of 1 if the student answered "extremely important" and a value of 0 otherwise.

Question: Do Beta University students in 2019 think it is as important for a young person to not live with their parents as people did in 2002?

Null Hypothesis: We wish to know if the average percent of students in 2019 is different from that in 2002. To answer this, we conduct a stochastic proof by contradiction, in which we assume that the percent of students in 2019 would be 29% just as in 2002. Our null hypothesis is thus that the proportion is 0.29.

\[H_0: \pi = 0.29\] \[H_A: \pi \neq 0.29\]

Given the cost of collecting this data and associated trade-offs, we decided to set our Type I error at 10%.

Estimate: Let's use prop.test() to obtain our estimate. Recall that prop.test() requires an argument x that gives the counts of "successes," which in this case means the number of people who responded "extremely important" (i.e. important == 1 in the data), and an argument n for the total number of "trials," which in this case means the total number of people surveyed (i.e. n = 100).

beta_U_summary <- beta_U_data %>% 
  summarize(important = sum(important),
            n = n())
beta_U_summary
# A tibble: 1 x 2
  important     n
      <dbl> <int>
1        32   100

The default for prop.test() assumes the null hypothesis is \(\pi = 0.5\), so we need to override that here by specifying p = 0.29.

prop.test(x = beta_U_summary$important, n = beta_U_summary$n, p = 0.29)

    1-sample proportions test with continuity correction

data:  beta_U_summary$important out of beta_U_summary$n, null probability 0.29
X-squared = 0.3, df = 1, p-value = 0.6
alternative hypothesis: true p is not equal to 0.29
95 percent confidence interval:
 0.232 0.422
sample estimates:
   p 
0.32 

We see that the estimate from our sample is \(\hat{\pi} = 0.32\). This results in a p-value of \(0.6\). That is, if in fact 29% of Beta University undergraduates felt strongly that young adults needed to not live at home, then we would observe a value this different (32%) in our sample in about 60% of random samples collected in the same way.

Conclusion: Since our p-value is greater than the stated 10% Type I error, we would not reject our null hypothesis. We can thus conclude that there is not enough evidence to suggest that the percent of students that feel that young adults should not live at home is different at Beta University in 2019 than in the general public in 2002. Note that because the confidence interval [0.232, 0.422] contains the hypothesized value of 0.29, this is also an indication that there is insufficient evidence to overturn the null hypothesis. Importantly, this result only generalizes to all Beta University students if the sample was collected randomly. If it was collected based upon convenience, these results might not generalize. More information would be required to determine this.

13.4 Example: Estimate the relationship between two variables

Let’s return to the Lego dataset we explored in the beginning of this course. One of the authors of this book is both a huge Harry Potter fan and a Lego connoisseur. After buying a few of these sets, he began to wonder about the relationship between the number of minifigures in these sets and the price of the set. To answer this, he returned to the legosets data, focusing only on the subset relevant to this question.

Question: What is the relationship between the number of minifigures in a Harry Potter lego set and the price of the set?

Let's load in the data, create the relevant Harry Potter subset, and skim the relevant variables. Recall that USD_MSRP is our price variable in US dollars.

legosets <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSwB0bCE4w7qrep4lIC-QVyF307mxFNUYQ08LZqiFsC_ks1bt_ZEJqiTEo7SaCl6g4TQ8gig2ZIfQJu/pub?gid=0&single=true&output=csv")

legosets_HP <- legosets %>% 
  filter(Theme == "Harry Potter")

skim_with(numeric = list(hist = NULL))
legosets_HP %>% 
  select(USD_MSRP, Minifigures) %>% 
  skim()
Skim summary statistics
 n obs: 53 
 n variables: 2 

── Variable type:integer ──────────────────────────────────────────────────────────
    variable missing complete  n mean   sd p0 p25 p50 p75 p100
 Minifigures       2       51 53 3.71 2.62  1   2   3   4   12 

── Variable type:numeric ──────────────────────────────────────────────────────────
 variable missing complete  n  mean    sd p0 p25 p50   p75   p100
 USD_MSRP       0       53 53 34.36 33.86  0  10  20 49.99 149.99

Model: A question about ‘relationships’ in statistics typically means that we are interested in the slope in a regression, i.e.,

\[\widehat{USD\_MSRP} = b_0 + b_1*Minifigures\]

This model can be estimated in R using the following syntax:

lego_model <- lm(USD_MSRP ~ Minifigures, data = legosets_HP)

Note that we do not necessarily need to use a hypothesis test here to make a decision about anything; rather, we are interested in simply estimating the magnitude of the relationship between minifigures and price in the population of Harry Potter legosets. Therefore, we can simply construct a confidence interval, rather than conducting a hypothesis test.

Estimate: We can obtain our estimate \(b_1\) from our data using summary(lego_model), which provides us an estimate of this relationship.

summary(lego_model)

Call:
lm(formula = USD_MSRP ~ Minifigures, data = legosets_HP)

Residuals:
   Min     1Q Median     3Q    Max 
-50.87  -8.44   0.20   4.87  81.29 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -3.45       4.87   -0.71     0.48    
Minifigures    10.54       1.08    9.79    4e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.9 on 49 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.662, Adjusted R-squared:  0.655 
F-statistic: 95.9 on 1 and 49 DF,  p-value: 4.03e-13

Here the slope \(b_1 = 10.54\); this means that for every additional minifigure in a Harry Potter lego set, the average price increases by $10.54. How confident are we in this estimate? Is it precise? We can use the function confint() to construct a 95% confidence interval.

confint(lego_model)
             2.5 % 97.5 %
(Intercept) -13.24   6.33
Minifigures   8.38  12.70

Conclusion: We are 95% confident that the true relationship between minifugures and price, \(\beta_1\), is captured in the interval [8.38, 12.70]. Notably, this relationship cannot be generalized to all lego sets – only Harry Potter legosets.

13.5 Final thoughts

As we wrap up this course, take a moment to reflect on what you have learned. You now know how to explore, describe and visualize data. You also know some of the basic principles of statistical theory – the role of randomization and chance, the idea that your data is one version of hypothetically many other versions you could have, the fundamentals of how to use data to test and prove claims. And you know how to apply these principles to real data, to understand uncertainty, to determine when patterns are common or rare, and to test hypotheses.