Skip to contents

The function stratify() takes as input any data frame with observations (rows) that you wish to stratify into clusters. Typically, the goal of such stratification is developing a sampling design for maximizing generalizability. This function, and the others in this package, are designed to mimic the website


  data = NULL,
  guided = TRUE,
  n_strata = NULL,
  variables = NULL,
  idvar = NULL,
  verbose = TRUE



data.frame object containing the population data to be stratified (observations as rows); must include a unique id variable for each observation, as well as covariates.


logical, defaults to TRUE. Whether the function should be guided (ask questions and behave interactively throughout) or not. If set to FALSE, the user must provide values for other arguments below


integer, defaults to NULL. If guided is set to FALSE, must provide a number of strata in which to divide to cluster population


character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the names of stratifying variables (from population data frame)


character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the name of the ID variable (from population data frame)


logical, defaults to TRUE.


The function returns a list of class "generalizeR_stratify" that can be provided as input to recruit(). More information on the components of this list can be found above under "Details."


The list contains 14 components: idvar, variables, dataset, n_strata, solution, pop_data_by_stratum, summary_stats, data_omitted, cont_data_stats, cat_data_levels, heat_data, heat_data_simple, heat_data_kable, and heat_plot.


a tibble with number of rows equal to the number of rows in the inference population (data) and number of columns equal to the number of stratifying variables (dummy-coded if applicable) plus the ID column (idvar) and a column representing stratum membership, Stratum


Tipton, E. (2014). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2), 109-139.

Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39(6), 478-501.



selection_covariates <- c("total", "pct_black_or_african_american",
                          "pct_white", "pct_female", "pct_free_and_reduced_lunch")
stratify(generalizeR:::inference_pop, guided = FALSE, n_strata = 4,
         variables = selection_covariates, idvar= "ncessch")
#> This might take a little while. Please bear with us.
#> Calculated distance matrix.
#> iteration: 1 --> total WCSS: 338.562  -->  squared norm: 1.40551
#> iteration: 2 --> total WCSS: 205.024  -->  squared norm: 0.138412
#> iteration: 3 --> total WCSS: 203.903  -->  squared norm: 0.0419178
#> iteration: 4 --> total WCSS: 203.775  -->  squared norm: 0.0313237
#> iteration: 5 --> total WCSS: 203.729  -->  squared norm: 0
#> ===================== end of initialization 1 =====================