Stratify a Population Data Frame — stratify • generalizeR

The function stratify() takes as input any data frame with observations (rows) that you wish to stratify into clusters. Typically, the goal of such stratification is developing a sampling design for maximizing generalizability. This function, and the others in this package, are designed to mimic the website https://www.thegeneralizer.org/.

Usage

stratify(
  data = NULL,
  guided = TRUE,
  n_strata = NULL,
  variables = NULL,
  idvar = NULL,
  verbose = TRUE
)

Arguments

data: data.frame object containing the population data to be stratified (observations as rows); must include a unique id variable for each observation, as well as covariates.
guided: logical, defaults to TRUE. Whether the function should be guided (ask questions and behave interactively throughout) or not. If set to FALSE, the user must provide values for other arguments below
n_strata: integer, defaults to NULL. If guided is set to FALSE, must provide a number of strata in which to divide to cluster population
variables: character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the names of stratifying variables (from population data frame)
idvar: character, defaults to NULL. If guided is set to FALSE, must provide a character vector of the name of the ID variable (from population data frame)
verbose: logical, defaults to TRUE.

Value

The function returns a list of class "generalizeR_stratify" that can be provided as input to recruit(). More information on the components of this list can be found above under "Details."

Details

The list contains 14 components: idvar, variables, dataset, n_strata, solution, pop_data_by_stratum, summary_stats, data_omitted, cont_data_stats, cat_data_levels, heat_data, heat_data_simple, heat_data_kable, and heat_plot.

pop_data_by_stratum:: a tibble with number of rows equal to the number of rows in the inference population (data) and number of columns equal to the number of stratifying variables (dummy-coded if applicable) plus the ID column (idvar) and a column representing stratum membership, Stratum

References

Tipton, E. (2014). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2), 109-139.

Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39(6), 478-501.

Examples

library(tidyverse)

selection_covariates <- c("total", "pct_black_or_african_american",
                          "pct_white", "pct_female", "pct_free_and_reduced_lunch")
stratify(generalizeR:::inference_pop, guided = FALSE, n_strata = 4,
         variables = selection_covariates, idvar= "ncessch")
#> 
#> This might take a little while. Please bear with us.
#> 
#> Calculated distance matrix.
#>  
#> iteration: 1 --> total WCSS: 338.562  -->  squared norm: 1.40551
#> iteration: 2 --> total WCSS: 205.024  -->  squared norm: 0.138412
#> iteration: 3 --> total WCSS: 203.903  -->  squared norm: 0.0419178
#> iteration: 4 --> total WCSS: 203.775  -->  squared norm: 0.0313237
#> iteration: 5 --> total WCSS: 203.729  -->  squared norm: 0
#>  
#> ===================== end of initialization 1 =====================
#>