The function `stratify()`

takes as input any data frame with observations (rows) that you wish to stratify into clusters. Typically, the goal of such stratification is developing a sampling design for maximizing generalizability. This function, and the others in this package, are designed to mimic the website https://www.thegeneralizer.org/.

## Usage

```
stratify(
data = NULL,
guided = TRUE,
n_strata = NULL,
variables = NULL,
idvar = NULL,
verbose = TRUE
)
```

## Arguments

- data
data.frame object containing the population data to be stratified (observations as rows); must include a unique id variable for each observation, as well as covariates.

- guided
logical, defaults to TRUE. Whether the function should be guided (ask questions and behave interactively throughout) or not. If set to FALSE, the user must provide values for other arguments below

- n_strata
integer, defaults to NULL. If guided is set to FALSE, must provide a number of strata in which to divide to cluster population

- variables
character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the names of stratifying variables (from population data frame)

- idvar
character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the name of the ID variable (from population data frame)

- verbose
logical, defaults to TRUE.

## Value

The function returns a list of class "generalizeR_stratify" that can be provided as input to `recruit()`

. More information on the components of this list can be found above under "Details."

## Details

The list contains 14 components: `idvar`

, `variables`

, `dataset`

, `n_strata`

, `solution`

, `pop_data_by_stratum`

, `summary_stats`

, `data_omitted`

, `cont_data_stats`

, `cat_data_levels`

, `heat_data`

, `heat_data_simple`

, `heat_data_kable`

, and `heat_plot`

.

`pop_data_by_stratum`

:a tibble with number of rows equal to the number of rows in the inference population (

`data`

) and number of columns equal to the number of stratifying variables (dummy-coded if applicable) plus the ID column (`idvar`

) and a column representing stratum membership,`Stratum`

## References

Tipton, E. (2014). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. *Evaluation Review*, *37*(2), 109-139.

Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. *Journal of Educational and Behavioral Statistics*, *39*(6), 478-501.

## Examples

```
library(tidyverse)
selection_covariates <- c("total", "pct_black_or_african_american",
"pct_white", "pct_female", "pct_free_and_reduced_lunch")
stratify(generalizeR:::inference_pop, guided = FALSE, n_strata = 4,
variables = selection_covariates, idvar= "ncessch")
#>
#> This might take a little while. Please bear with us.
#>
#> Calculated distance matrix.
#>
#> iteration: 1 --> total WCSS: 338.562 --> squared norm: 1.40551
#> iteration: 2 --> total WCSS: 205.024 --> squared norm: 0.138412
#> iteration: 3 --> total WCSS: 203.903 --> squared norm: 0.0419178
#> iteration: 4 --> total WCSS: 203.775 --> squared norm: 0.0313237
#> iteration: 5 --> total WCSS: 203.729 --> squared norm: 0
#>
#> ===================== end of initialization 1 =====================
#>
```