Panel Paper: Assessing Correspondence in (Design)-Replication Studies

Thursday, November 2, 2017
McCormick (Hyatt Regency Chicago)

*Names in bold indicate Presenter

Vivian C Wong, University of Virginia and Peter Steiner, University of Wisconsin - Madison


Reproducibility is a hallmark of science. Instead of relying on causal conclusions from a single experimental or non-experimental study, scientific knowledge is best achieved through careful replication of studies, or meta-analysis of results from multiple studies. Prior efforts to replicate study results have led to disappointing rates of reproducibility using multiple metrics of correspondence (Novak et al., 2012). Critics, however, have challenged the interpretation of these replication efforts (Gilbert et al., 2015), suggesting alternative explanations for why reproducibility was not achieved. Recent debates about the "replication crisis" have raised questions about the essential design features needed for a replication study, and whether a pure replication study is feasible at all. The same challenges arise for assessing “correspondence” in results of design-replication studies, also called within-study comparisons (WSCs). Here, the WSC analyst must evaluate whether a non-experimental method reproduces the causal effect from a corresponding experimental benchmark study.

This paper addresses methodological issues for assessing correspondence in replication studies. We begin by formalizing the replication study as its own research design. The paper shows five stringent assumptions needed for the direct reproducibility of study results, as well as design requirements for examining replication across different investigators, settings, treatments, units, and methods. The paper then examines statistical properties of common metrics for assessing correspondence in replication studies. Steiner and Wong (2016) distinguish between two classes of measures for assessing correspondence in results. The first is a distance-based (or difference-based) measure, which estimates the difference in the original (or benchmark) result and the replicated effect. The second class of metrics is what Steiner and Wong (2016) call “conclusion-based measures,” and is the most popular approach for assessing correspondence in results. These approaches assess results comparability by looking at the size, direction, and statistical significance patterns of results.

Results from our simulation study show that conclusion-based measures are highly sensitive to the statistical power in both studies (i.e., the magnitude of the unknown true effect, the sample sizes, and error variances) and the direction of the bias if present (biases are particularly likely if one compares a non-experimental estimate with a randomized experiment). These results suggest that researchers should interpret conclusion-based measures cautiously within the context of their study conditions. The results also highlight some misunderstandings in the “replication crisis” debate in science. Without knowing the underlying true effects of an intervention, or the studies’ real power to detect the true effect, it is nearly impossible to assess the degree of correspondence one can reasonably expect from replicating a study. Assessing correspondence with conclusion-based measures may be misleading if one of the studies is contaminated with bias. Therefore, in addition to conclusion-based measures, researchers should consider alternative measures for assessing correspondence, such as distance-based measures like significance and equivalence tests of the estimated difference in two results.