Panel Paper:
Assessing Correspondence in (Design)-Replication Studies
*Names in bold indicate Presenter
This paper addresses methodological issues for assessing correspondence in replication studies. We begin by formalizing the replication study as its own research design. The paper shows five stringent assumptions needed for the direct reproducibility of study results, as well as design requirements for examining replication across different investigators, settings, treatments, units, and methods. The paper then examines statistical properties of common metrics for assessing correspondence in replication studies. Steiner and Wong (2016) distinguish between two classes of measures for assessing correspondence in results. The first is a distance-based (or difference-based) measure, which estimates the difference in the original (or benchmark) result and the replicated effect. The second class of metrics is what Steiner and Wong (2016) call “conclusion-based measures,” and is the most popular approach for assessing correspondence in results. These approaches assess results comparability by looking at the size, direction, and statistical significance patterns of results.
Results from our simulation study show that conclusion-based measures are highly sensitive to the statistical power in both studies (i.e., the magnitude of the unknown true effect, the sample sizes, and error variances) and the direction of the bias if present (biases are particularly likely if one compares a non-experimental estimate with a randomized experiment). These results suggest that researchers should interpret conclusion-based measures cautiously within the context of their study conditions. The results also highlight some misunderstandings in the “replication crisis” debate in science. Without knowing the underlying true effects of an intervention, or the studies’ real power to detect the true effect, it is nearly impossible to assess the degree of correspondence one can reasonably expect from replicating a study. Assessing correspondence with conclusion-based measures may be misleading if one of the studies is contaminated with bias. Therefore, in addition to conclusion-based measures, researchers should consider alternative measures for assessing correspondence, such as distance-based measures like significance and equivalence tests of the estimated difference in two results.