*Names in bold indicate Presenter
The magnitude of the external validity bias in actual studies is an open question. To estimate the bias, Bell et al. (2012) used methods similar to those used in “design replication” studies—studies that estimate the internal validity bias from non-experimental methods. While based on impact estimates for a single education program, the magnitude of the estimated external validity bias, roughly 0.10 standard deviations, is as large as the internal validity bias that would arise from a relatively naïve impact estimation model. This suggests that research on how to reduce the bias would be useful.
In this paper, we investigate methods for reducing the external validity bias due to purposive site selection. The methods we test are appropriate for evaluations that have already been conducted and are based on established procedures for handling selection bias and missing data in surveys and observational studies. These methods include post-stratification, regression, and propensity score-based strategies, where the “propensity score” in this case models the probability of being in the evaluation. We investigate the performance of these approaches when a limited set of site characteristics is available to use in these adjustments and when a larger set of variables is available, since the performance of the methods is expected to depend on the extent to which the variables associated with treatment effects and selection into the purposive sample are observed.
For these investigations we take advantage of a unique data source collected by Abt Associates for an evaluation of Reading First and used to estimate external validity bias in Bell et al. (2012). The data include student-level longitudinal data for all school districts in 9 states; we also have collected lists of school districts that participated in 11 IES-funded impact evaluations that each selected a purposive sample. These data allow us to estimate impacts for all districts in the 9 states and in the subset of these districts that were included in one or more of the purposive samples. In this work we use the statistical methods listed above to adjust the impact estimates based on the purposively selected sites and assess how much closer the resulting impact estimates are to the true population impacts.