Panel Paper: Assessing Methods to Reduce the External Validity Bias Due to Purposive Site Selection

Saturday, November 10, 2012 : 9:30 AM
Calhoun (Sheraton Baltimore City Center Hotel)

*Names in bold indicate Presenter

Larry Orr1, Robert Olsen2, Stephen Bell3 and Elizabeth Stuart1, (1)Johns Hopkins University, (2)Rob Olsen LLC, (3)Westat

Many impact evaluations are carried out in a set of purposively selected sites, for example welfare offices, school districts, or Head Start centers.  Conclusions from those evaluations are often then used to guide policy decisions for some population of interest, such as the students in a particular state.  However, the sites in the evaluation are generally not selected randomly and may not be representative of that population.   Olsen et al. (2012) demonstrate that purposive site selection can yield bias in the usual sense that across infinite replications of the experiment, the average impact estimate will differ from the true average effect in the population.  We refer to this difference as the external validity bias from purposive site selection.

The magnitude of the external validity bias in actual studies is an open question. To estimate the bias, Bell et al. (2012) used methods similar to those used in “design replication” studies—studies that estimate the internal validity bias from non-experimental methods. While based on impact estimates for a single education program, the magnitude of the estimated external validity bias, roughly 0.10 standard deviations, is as large as the internal validity bias that would arise from a relatively naïve impact estimation model.  This suggests that research on how to reduce the bias would be useful.

In this paper, we investigate methods for reducing the external validity bias due to purposive site selection. The methods we test are appropriate for evaluations that have already been conducted and are based on established procedures for handling selection bias and missing data in surveys and observational studies.  These methods include post-stratification, regression, and propensity score-based strategies, where the “propensity score” in this case models the probability of being in the evaluation. We investigate the performance of these approaches when a limited set of site characteristics is available to use in these adjustments and when a larger set of variables is available, since the performance of the methods is expected to depend on the extent to which the variables associated with treatment effects and selection into the purposive sample are observed. 

For these investigations we take advantage of a unique data source collected by Abt Associates for an evaluation of Reading First and used to estimate external validity bias in Bell et al. (2012).  The data include student-level longitudinal data for all school districts in 9 states; we also have collected lists of school districts that participated in 11 IES-funded impact evaluations that each selected a purposive sample.  These data allow us to estimate impacts for all districts in the 9 states and in the subset of these districts that were included in one or more of the purposive samples.  In this work we use the statistical methods listed above to adjust the impact estimates based on the purposively selected sites and assess how much closer the resulting impact estimates are to the true population impacts.