Extensions of Within-Study Comparison Approaches to Investigate the Generalizability of Causal Inferences Across Study Sites

Jaciw, Andrew; Jaciw, Andrew

Within-Study Comparison (WSC) methods have been used extensively to assess the internal validity of inferences from study designs that utilize a non-experimental counterfactual to cases randomly assigned to treatment. The question is: how well can we replicate the experimental result if we lack an experimental control, and instead, have to use a non-experimental comparison group as an alternative? A variant of this approach attempts to replicate the experimental estimate for a given site in a multisite trial, where selection of comparison cases is from among control cases at other sites in the study. Here the question addresses the magnitude of bias, from selection of individuals into sites, and the effectiveness of methods to ameliorate it.

The current work extends the multisite variant of WSC to investigate bias from using a non-experimental counterfactual to cases randomly assigned to control. It asks: how accurately can we infer how the control group would have performed at a given site, had they been assigned to treatment, using performance outcomes from individuals randomly assigned to treatment at the other sites? The availability of an experimental estimate at the target site allows gauging the accuracy of the non-experimental approach. This application of WSC assesses how well outcomes from other sites allow an accurate generalization of impact for a given site.

The current work has three main foci. The first is the development of the WSC framework to address questions of external validity. We show that average absolute bias, from using a non-experimental comparison to infer counterfactual performance to controls at a site, can be decomposed into three terms due to: differences between sites being compared in average performance of the controls (Bias 1 reflects effects of confounders, and is the quantity of interest in traditional WSC studies), the difference between them in average program impact (Bias 2 reflects imbalance on moderators of impact), and the covariance between these two biases. Second, we develop an approach to summarizing bias, when each site of a multisite trial yields an estimate of bias. We propose the square root of average squared bias as an alternative to the often-used average of the absolute value of bias. It allows summarizing separately the three quantities described above. Third, we apply the methodology to results from two multisite trials in education: the Tennessee STAR Class Size Reduction Experiment and a randomized trial of the Alabama Math Science and Technology Initiative (AMSTI). We found Bias 1 to be larger than Bias 2. For example, in the AMSTI study, Bias 1 and Bias 2 were .42 and .10 standard deviation units, respectively, before covariate adjustment, and .07 and .08, after adjusting for main and moderating effects of covariates; also, a negative covariance between biases led to total average absolute bias of .04 standard deviations with covariate adjustments.

We conclude by discussing implications for assessing accuracy of different causal quantities (i.e., ITT versus TOT), and how lessons learned from decades of WSC research may be applied to the methods investigated in the current work.

Association for Public Policy Analysis & Management

Panel Paper: Extensions of Within-Study Comparison Approaches to Investigate the Generalizability of Causal Inferences Across Study Sites