Panel Paper: Replication Designs for Causal Inference

Friday, November 9, 2018
Marriott Balcony A - Mezz Level (Marriott Wardman Park)

*Names in bold indicate Presenter

Vivian C. Wong, University of Virginia and Peter Steiner, University of Wisconsin, Madison


Considerable attention has been devoted to examining the prevalence and success of reproducibility efforts. Despite consensus to promote the reproducibility of scientific findings, there is substantial disagreement about how these results should be interpreted. There are three reasons why study results may not reproduce. They include random error, low statistical power, and bias (Gilbert, King, Pettigrew, et al., 2016). Recent methodological work on reproducibility has focused on statistical issues related to error and power for detecting comparability of results (Benjamin, Berger, Johannesson et al., 2017; Simonsohn, 2015). Bias, however, remains a serious challenge. Bias refers to differences in the original and replication studies that is related to the study’s outcome. In replication contexts, potential sources of bias are numerous and broadly defined. This includes when researchers manipulate or selectively present findings, when there are deviations in the research protocol, when treatment and control conditions vary, and/or when the method of analysis changes across studies. Contextual changes that occur between the original and replication studies may also affect the reproducibility of results. Combined, these criticisms suggest that bias may be the key methodological challenge for replication, especially in fields where there is high uncertainty in the outcome and limited experimenter control.

This paper provides a formal understanding of replication as a research design through the potential outcomes framework (Rubin, 1974). We describe five assumptions needed for two studies (and original and replication study) to produce identical results (within the limits of sampling error). First, there must be stability in the outcomes and treatment conditions across study arms. This implies that treatment conditions are well defined, and that there are no peer or contamination effects across study arms. Second, the causal estimand of interest that is being compared must be same. Third, the causal estimand of interest for the replication population must be well identified across both studies. This may be achieved through an RCT or a well implemented quasi-experiment. Fourth, an unbiased – or consistent – estimator with sufficiently large samples must be used for estimating effects. Fifth, there should be no reporting error of results once they have been identified and estimated.

The assumptions for replication in field settings are stringent and often are not feasible. However, by employing clever research design elements and empirical diagnostics for ruling out plausible biases, replication design assumptions may be met. To this end, the paper highlights three design variants of replication, which include prospective, within-study, and matched replication approaches. In the first two replication designs, the researcher introduces systematic variation between the original and replication studies, which provides greater confidence that replication design assumptions are met. In matched replication designs, however, study arm differences occur naturally and are not researcher controlled. The paper highlights the relative strengths and weaknesses of each replication research design, and examples of each approach.