Methods for Assessing Correspondence in Non-Experimental and Benchmark Results in within-Study Comparison Designs: Results from an Evaluation of Repeated Measures Approaches

Wong, Vivian C.; Wong, Vivian C.

Randomized experiments have long been established as the gold standard for addressing causal questions. However, experiments are not always feasible or desired, so non-experimental (NE) methods are also needed. Over the last three decades, the within-study comparison (WSC) design has emerged as a method to evaluate the performance of non-experimental methods in field settings. In the traditional WSC design, treatment effects from an experiment are compared to those produced by a NE approach that shares at least the same target population and intervention. The goals of the WSC are to determine whether the non-experiment can replicate results from a high-quality randomized experiment (which provides the causal benchmark estimate).

In WSC designs, correspondence between benchmark and NE effects may be assessed in a number of ways. To examine the policy question of whether the experiment and NE produce comparable results in field settings, correspondence may be assessed by looking at the direction and magnitude of effects, as well as statistical significance patterns of treatment effects in the experiment and non-experiment. To assess the methodological question of whether the NE produces unbiased results in field settings, researchers may look at direct measures of bias by computing the difference in NE and experimental effect estimates, the percent of bias reduced from the initial naïve comparison, and the effect size difference between experimental and NE results. However, because of sampling error, even close replications of the same randomized experiment should not result in exactly identical posttest sample means and variances. Therefore, another common approach for assessing correspondence in experimental and NE results is to use statistical tests of differences between NE and experimental results with bootstrapped standard errors to account for covariance in the experimental and non-experimental data when appropriate. A careful consideration of the standard null hypothesis significance testing (NHST) framework in the WSC context, however, suggests serious weaknesses to this approach. The paper proposes using statistical tests of equivalence for assessing correspondence in WSC results. Tests of equivalence are useful for contexts where a researcher wishes to assess whether a new or an alternative approach (such as a non-experiment) performs as well as the gold standard experimental approach.

To demonstrate different methods for assessing correspondence in WSC results, the study employs results from a WSC design that examines the performance of interrupted time series approaches. The WSC was constructed using experimental data from the Cash and Counseling Demonstration Project, which evaluated the effects of a “consumer-directed” care program on Medicaid recipients’ outcomes. The data include monthly Medicaid measures 12 months prior to and post random assignment. ITT and TOT effects for the experimental benchmark are compared with those obtained from the CITS design where comparison units were matched from two other states. The paper highlights advantages and disadvantages of various methods for assessing correspondence in WSC designs, and provides guidelines for establishing criteria on determining whether non-experimental methods succeed in replicating benchmark results.

Association for Public Policy Analysis & Management

Panel Paper: Methods for Assessing Correspondence in Non-Experimental and Benchmark Results in within-Study Comparison Designs: Results from an Evaluation of Repeated Measures Approaches