Panel Paper: Addressing External Validity in within-Study Comparisons

Thursday, November 7, 2019
Plaza Building: Concourse Level, Governor's Square 10 (Sheraton Denver Downtown)

*Names in bold indicate Presenter

Mark White, University of Michigan


Within-study comparisons (WSC) have traditionally focused on comparing causal estimates from randomly assigned treatment and control groups to causal estimates from non-experimental approaches, where the non-experimental approach uses the experimental treatment group. This focuses attention on internal validity and the important question of whether non-experimental estimates are able to account for selection into the treatment and eliminate internal validity bias (as demonstrated by their ability to match experimental effect estimates). This, I argue, is an unfair evaluation of the usefulness of non-experimental approaches as non-experimental approaches often gain external validity at the cost of internal validity, a gain which typical WSC designs ignore. This is especially true in education where non-experimental estimates are often generated by comparative interrupted time series (CITS) analyses of samples much larger than are typical in experiments and arguably more generalizable to important inference populations. There is a need to expand WSC designs to account for issues related to external validity if WSC designs are to properly evaluate the usefulness of non-experimental approaches relative to experimental approaches.

This study proposes an expansion to WSC designs to incorporate concerns of external validity. Namely, we propose adding a second non-experimental analysis that studies the full inference population. For example, in an evaluation of a school improvement program, the second non-experimental study would analyze all schools that are using the program using a CITS design (defining these schools as the inference population). This results in three causal estimates, the experimental estimate, the non-experimental estimate on the experimental population, and the non-experimental estimate on the inference population. The first two estimates potentially have external validity bias while the last two potentially have internal validity bias. Under a set of assumptions, which we detail, we can use these three estimates to get both an estimate of the internal validity bias due to non-random assignment to treatment in the non-experimental approaches (using the difference between the non-experimental estimate on the experimental sample and the experimental estimate) and an estimate of the external validity bias due to non-randomly sampling the experimental sample from the inference population (using the difference between the two non-experimental estimates). Comparing the internal and external validity bias enables a valuation of whether experimental or non-experimental approaches provide more accurate estimate of program effectiveness in the inference population. I demonstrate this approach using data from a recent experimental evaluation of a beginning literacy program called Burst©:Reading, showing the external validity bias is greater than internal validity bias (at least in the point estimate), suggesting non-experimental approaches are more accurate.

This expansion of WSC designs allows for a more authentic comparison of experimental and non-experimental approaches because it pits the strengths of experimental approaches (i.e. high internal validity) against the strengths of many non-experimental approaches (i.e. high external validity). As such, it more comprehensively portrays the trade-offs in different approaches to establishing evidence of program effectiveness. I discuss ways of further expanding this approach to incorporate multiple inference populations.