# Chapter 11 P-values

In Chapter 10, we covered how to construct and interpret confidence intervals, which use the theory of repeated samples to make inferences from a sample (your data) to a population. To do so, we used counterfactual thinking that underpins statistical reasoning, wherein making inferences requires you to imagine alternative versions of your data that you might have under other possible samples selected in the same way. In this chapter, we extend this counterfactual reasoning to imagine other possible samples you might have seen if you knew the trend in the population. This way of thinking will lead us to define p-values.

### Needed Packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages.

library(tidyverse)
library(moderndive)
library(infer)
library(ggplot2movies)

## 11.1 Stochastic Proof by Contradiction

In many scientific pursuits, the goal is not simply to estimate a population parameter. Instead, the goal is often to understand if there is a difference between two groups in the population or if there is a relationship between two (or more) variables in the population. For example, we might want to know if average SAT scores differ between men and women, or if there is a relationship between education and income in the population in the United States.

Let’s take the difference in means between two groups as a motivating example. In order to prove that there is a difference between average SAT scores for men and women, we might proceed with what is in math called a proof by contradiction. Here, however, this proof is probabilistic (aka stochastic).

There are three steps in a Proof by Contradiction. In order to illustrate these, assume we wish to prove that there is a relationship between X and Y.

1. Negate the conclusion: Begin by assuming the opposite – that there is no relationship between X and Y.
2. Analyze the consequences of this premise: If there is no relationship between X and Y in the population, what would the sampling distribution of the estimate of the relationship between X and Y look like?
3. Look for a contradiction: Compare the relationship between X and Y observed in your sample to this sampling distribution. How (un)likely is this observed relationship?

If likelihood of the observed relationship is small (given your assumption of no relationship), then this is evidence that there is in fact a relationship between X and Y in the population.

## 11.2 Repeated samples, the null hypothesis, and p-values

### 11.2.1 Null hypothesis

In the example of asking if there is a difference in SAT scores between men and women, you will note that in order to prove that there is a difference, we begin by assuming that there is not a difference (Step 1). We call this the null hypothesis – it is the hypothesis we are attempting to disprove. The most common null hypotheses are:

• A parameter is 0 in the population (e.g. some treatment effect $$\theta = 0$$)
• There is no difference between two or more groups in the population (e.g. $$\mu_1 - \mu_2 = 0$$)
• There is no relationship between two variables in the population (e.g. $$\beta_1$$)
• The population parameter is equal to some norm known or assumed by previous data or literature (e.g. $$\pi = 0.5$$ or $$\mu = \mu_{norm}$$)

Importantly, this hypothesis is about the value or relationship in the population, not the sample. (This is a very easy mistake to make). Remember, you have data in your sample, so you know without a doubt if there is a difference or relationship in your data (that is your estimate). What you do not know is if there is a difference or relationship in the population. Once a null hypothesis is determined, the next step is to determine what the sampling distribution of the estimator would be if this null hypothesis were true (Step 2). We can determine what this null distribution would look like, just as we've done with sampling distributions more generally: using mathematical theory and formulas for known distributions.

### 11.2.2 P-values

Once the distribution of the sample statistic under the null hypothesis is determined, to complete the stochastic proof by contradiction, you simply need to ask: Given this distribution, how likely is it that I would have drawn a random sample in which the estimated value is this extreme or more extreme?

This is the p-value: The probability of your observing an estimate as extreme as the one you observed if the null hypothesis is true. If this p-value is small, it means that this data is unlikely to occur under the null hypothesis, and thus the null hypothesis is unlikely to be true. (See, proof by contradiction!)

In general, in order to estimate a p-value, you first need to standardize your sample statistic. This standardization makes it easier to determine the sampling distribution under the null hypothesis.

Standardization is conducted using the following formula:

$t\_stat = \frac{Estimate - Null \ \ value}{SE(Estimate)}$

Note this is just a special case of the previous standardization formula we've seen before, where here we're plugging in the "null value" for the mean of the estimate. The null value refers to the value of the population parameter assumed by the null hypothesis. As we mentioned, in many cases the null value is zero. That is, we begin the proof by contradiction by assuming there is no relationship, no differences between groups, etc. in the population.

This standardized statistic $$t\_stat$$ is then used to determine the sampling distribution under the null hypothesis and the p-value based upon the observed value.

## 11.3 P-value and Null Distribution Example

### 11.3.1 IMDB data

The movies dataset in the ggplot2movies package contains information on 58,788 movies that have been rated by users of IMDB.com.

movies
# A tibble: 58,788 x 24
title  year length budget rating votes    r1    r2    r3    r4    r5    r6
<chr> <int>  <int>  <int>  <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 $1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 2$100…  1939     71     NA    6      20   0    14.5   4.5  24.5  14.5  14.5
3 $21 … 1941 7 NA 8.2 5 0 0 0 0 0 24.5 4$40,…  1996     70     NA    8.2     6  14.5   0     0     0     0     0
5 $50,… 1975 71 NA 3.4 17 24.5 4.5 0 14.5 14.5 4.5 6$pent  2000     91     NA    4.3    45   4.5   4.5   4.5  14.5  14.5  14.5

$duration <dbl> 16.3, 13.5, 21.1, 22.4, 18.9, 25.1, 38.1, 21.8, 26.1, 15.6, … We want to know whether or not $$\mu_{\text{B}} = \19.50$$, so we estimate $$\mu_{\text{B}}$$ by $$\bar{x}_{\text{B}}$$. We find that in our sample, the average company B ride price is $$\20.30$$. rides_B %>% summarize(xbar = mean(price)) # A tibble: 1 x 1 xbar <dbl> 1 20.3 This is higher than the A population average of $$\19.50$$, but is this indicative of a true difference in average prices, or is this just the result of sampling variation and the fact we're only observing 100 data points? Let's compute a t-statistic and p-value to help answer this question. ### 11.4.1 Using formulas Note, we begin by assuming that company B also has a population average of$19.50, and we will examine whether our data seem to be consistent with that null hypothesis. That is, we start with the null hypothesis that $$\mu_{\text{B}} = \19.50$$. Note that this is a null hypothesis of the type $$\mu = \mu_{norm}$$, where we have a specific null value we want to compare our sample to. If this null hypothesis is true, we expect the t-statistic $$\frac{\bar{x} - 19.50}{\frac{s}{\sqrt{n}}}$$ to follow a t-distribution with $$df = n - 1 = 99$$. Let's compute this t-statistic for the values in our sample and compare it to this known sampling distribution.

rides_B %>%
summarize(xbar = mean(price),
s = sd(price),
n = n(),
SE = s/sqrt(n),
t_stat = (xbar - 19.50)/SE)
# A tibble: 1 x 5
xbar     s     n    SE t_stat
<dbl> <dbl> <int> <dbl>  <dbl>
1  20.3  5.18   100 0.518   1.53

By looking at Figure 11.3 and computing the p-value using pt(), we see that if company B does in fact have the same true population average price as company A (i.e. if $$\mu_{\text{B}} = \19.50$$), we would expect to observe an average price as large or larger than the one we did (i.e. $$\bar{x}_{\text{B}} = 20.3$$) about 13% of the time.

2*pt(-1.53, df = 99)
[1] 0.129

### 11.4.2 Using t.test

Let's compute the same information using t.test. Note in this case, we are only concerned with one mean (rather than a difference in two group means), so we only need to specify x. There is a default argument in t.test that sets the null value mu = 0, which we need to change to mu = 19.5. Let's run t.test on this data and examine the results.

rides_t.test <- t.test(rides_B$price, mu = 19.5) rides_t.test  One Sample t-test data: rides_B$price
t = 2, df = 99, p-value = 0.1
alternative hypothesis: true mean is not equal to 19.5
95 percent confidence interval:
19.3 21.3
sample estimates:
mean of x
20.3 
rides_t.test$stderr [1] 0.518 rides_t.test$statistic
   t
1.53 
rides_t.test\$p.value
[1] 0.13

t.test gives all the same values we saw when using the formulas to calculate these quantities "by hand."

### 11.4.3 Using regression

We can use a regression model with an intercept only to estimate the mean of a single variable. In this case, our model would be $$\widehat{price} = b_0$$. In order to specify this in R, we simply put a 1 on the right hand side of the tilde instead of specifying any predictor variables. Let's look at the results of this model.

ride_model <- lm(price ~ 1, data = rides_B)
summary(ride_model)

Call:
lm(formula = price ~ 1, data = rides_B)

Residuals:
Min      1Q  Median      3Q     Max
-12.878  -3.789   0.315   3.934  10.873

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   20.292      0.518    39.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.18 on 99 degrees of freedom

We see that this model gives the correct estimate $$\bar{x}_B = 20.3$$, standard error $$SE(\bar{x}_B) = 0.518$$, and degrees of freedom $$n - 1 = 99$$. But what's going on with the very large t-value of 39.1? In a regression framework, the model always assumes the null value is zero and therefore the t_stat is computed as $$\frac{Estimate - 0}{SE(Estimate)}$$. We could use the model output for $$\bar{x}, SE(\bar{x})$$, and $$df$$ to compute the correct t-value and p-value ourselves by subtracting off the null value of 19.5, similar to when we did the calculations via formulas. Alternatively, we could get the regression model to report the correct t-value and p-value by first centering our variable around the null value.

rides_B <- rides_B %>%
mutate(price_centered = price - 19.5)

ride_model_2 <- lm(price_centered ~ 1, data = rides_B)
summary(ride_model_2)

Call:
lm(formula = price_centered ~ 1, data = rides_B)

Residuals:
Min      1Q  Median      3Q     Max
-12.878  -3.789   0.315   3.934  10.873

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.792      0.518    1.53     0.13

Residual standard error: 5.18 on 99 degrees of freedom

This results in the correct t-value of 1.53 and a p-value of 0.13. Note that the Estimate column is now reporting $$\bar{x}_B - 19.5 = 20.292 - 19.5 = 0.792$$.

## 11.5 Interpretation of P-values

Like many statistical concepts, p-values are often misunderstood and misinterpreted. Remember, a p-value is the probability that you would observe data as extreme as the data you do if, in fact, the null hypothesis is true. As Wikipedia notes:

• The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.
• The p-value is not the probability that the observed effects were produced by random chance alone.
• The p-value does not indicate the size or importance of the observed effect.

Finally, remember that the p-value is a probabilistic attempt at making a proof by contradiction. Unlike in math, this is not a definitive proof. For example, if the p-value is 0.10, this means that if the null hypothesis is true, there is a 10% chance that you would observe an effect as large as the one in your sample. Depending upon if you are a glass-half-empty or glass-half-full kind of person, this could be seen as large or small:

• “Only 10% chance is small, which is unlikely. This must mean that the null hypothesis is not true,” or
• “But we don’t know that for sure: in 10% of possible samples, this does occur just by chance. The null hypothesis could be true.”

This will be important to keep in mind as we move towards using p-values for decision making in Chapter 12.