Assessing Statistical Methods for Estimating Population Average Treatment Effects from Purposive Samples in Education

Stuart, Elizabeth; Stuart, Elizabeth

Policy and decision makers often need to know whether something works “on average” in a target population of interest in order to make decisions about whether or not to implement a new intervention or program. For example, a school superintendent may want to know whether they should roll out a new math intervention in all schools in the district. A Head Start program administrator may need to know whether they should recommend a new behavior improvement program for all Head Start centers. However, the results from existing rigorous evaluations may not be directly informative about effects in target populations of interest, if the participants are not representative of those populations.

In particular, while non-experimental studies may suffer from internal validity bias—meaning that they can produce biased estimates of the effects of the intervention for the study sample—randomized trials (RCTs) may suffer from external validity bias for estimating effects in target populations of interest. Randomized trials are almost always conducted in non-randomly selected samples of subjects, which were not required to participate. When evaluations obtain participants in this manner they will produce impact estimates with external validity bias if impacts vary across subjects and the impacts are correlated with a subject’s probability of inclusion in the study (Olsen et al., 2013). Recent research has found empirical evidence that this bias can be non-trivial (Allcott, 2015; Bell et al., 2011).

Recently proposed statistical methods aim to bridge this gap by estimating population treatment effects by combining information from an RCT and on a target population of interest (e.g., Kern, Stuart, Hill, and Green, under review; Tipton, 2013). These methods fall into two broad classes: (1) flexible regression models of the outcome as a function of treatment status and covariates, and (2) reweighting methods that weight the RCT sample to reflect the covariate distribution in the population. However, there has been little formal investigation of the methods and how well (or when) they might work.

This paper presents results from simulation studies examining the performance of methods in each of these two broad classes of approaches. The simulations are designed to be as realistic as possible, based on data on a representative sample of public school students nationwide, empirical evidence on impact variation in two large-scale RCTs in education, and evidence on the types of schools that were selected for several RCTs in education. We find that when the assumptions underlying each approach are satisfied each approach works well. However, when key assumptions are violated – for example, if we do not observe all of the factors that moderate treatment effects and that differ between the RCT sample and the target population – none of the methods consistently estimates the population effects. We conclude with recommendations for practice, including the need for thorough and consistent covariate measurement and a better understanding of treatment effect heterogeneity. This work helps to identify the conditions under which different statistical methods can reduce external validity bias in educational evaluations.

Association for Public Policy Analysis & Management

Panel Paper: Assessing Statistical Methods for Estimating Population Average Treatment Effects from Purposive Samples in Education