Panel Paper: A Meta-Analysis of within-Study Comparisons

Thursday, November 7, 2019
Plaza Building: Concourse Level, Governor's Square 10 (Sheraton Denver Downtown)

*Names in bold indicate Presenter

Kylie L. Anglin1, Vivian Wong1 and Peter Steiner2, (1)University of Virginia, (2)University of Wisconsin, Madison


Given the widespread use of non-experimental approaches for assessing the causal impact of interventions, there is a strong need to identify which non-experimental methods can produce credible impact estimates in field settings. Over the last three decades, the within-study comparison (WSC) design has emerged as a method of empirically evaluating non-experimental designs in field settings in order to this need. In the traditional WSC design, treatment effects from a randomized experiment (RCT) are compared to those produced by a non-experimental approach that shares at least the same target population and intervention. The goals of the WSC are to determine whether the non-experiment can replicate results from a high-quality RCT (which provides the causal benchmark estimate), and the contexts and conditions under which these methods work or do not work in practice.

A common critique of individual WSC evaluations concerns their generalizability. Results from a single WSC study have little to say about general method performance, but results from multiple WSCs may provide insights as to how well these methods perform for similar outcomes and settings of interest. To that end, this paper presents the initial findings of a meta-analysis of all published and unpublished WSCs that have been conducted since 1986. These studies include substantial variation in contexts, non-experimental methods examined, as well as outcomes and treatment selection mechanisms.

The goal of the meta-analysis is to produce a summary effect of the difference in experimental and the non-experimental results for a particular topic area and method, and to explain the difference by variations in WSC characteristics. Because the majority of WSCs present multiple bias estimates (e.g., from multiple outcomes, or different matching estimators), we analyze the extent of bias using a hierarchical linear model. The multi-level analysis allows an assessment of whether non-experimental methods are able to replicate experimental benchmark estimates on average, and whether the bias estimates are homogeneous across studies. In particular, we analyze the relationship between non-experimental bias and five variables of interest: 1) the field of study; 2) the outcome of interest; 3) the benchmark RCT quality; 4) the non-experimental method; and 5) the non-experimental quality. In addition, we address unique methodological challenges with meta-analyzing WSC results that include observational studies with both positive and negative selection processes.

Though we have not yet coded the entire corpus of WSCs (coding is on schedule to be completed by August 2019), initial results confirm the findings of qualitative syntheses and the traditionally understood hierarchy of designs. Regression discontinuity designs outperform comparative interrupted time series designs which outperform matching designs. We are not yet powered to analyze differences in bias by the field of study or the outcome of interest, but initial findings suggest that the bias of even matching designs are minimal when pre-test measures are included as covariates. Because coding will be complete in August 2019, our final results will include point estimates of the average standardized bias across all WSCs and by key characteristics.