Panel Paper: Preliminary Results from a Six-Arm within-Study Comparison

Thursday, November 7, 2019
Plaza Building: Concourse Level, Governor's Square 10 (Sheraton Denver Downtown)

*Names in bold indicate Presenter

Bryan Keller, Columbia University


Within-study comparisons (WSCs) are broadly defined as studies that permit, by design, comparisons of effect size estimates from observational studies to estimates based on randomized experiments for the same intervention. Prior to 2008, within-study comparisons almost exclusively used a “three-arm” design, wherein the control group from a randomized experiment is replaced by a non-randomly selected comparison group. Shadish, Clark, & Steiner (2008) proposed and implemented a “four-arm” design in which participants were randomly assigned to be in either a randomized experiment or a quasi-experiment.


The four-arm design handles, through randomization, potential confounds that could invalidate comparisons based on three-arm designs; however, the four-arm design does not allow for estimation of conditional average treatment effects (CATEs) such as the average treatment effect on the treated (ATT), which are often of more interest in practice than the overall ATE. In this presentation I will report on the design, analysis, and results of a six-arm WSC proposed by Shadish & Steiner (2008) that incorporates an additional level of random assignment that permits experimental and quasi-experimental identification of the ATE, ATT, and average treatment effect on the controls (ATC).


The experimental arm of the six arm design is identical to the experimental arms of the three- and four-arm design replication studies. On the quasi-experimental side, however, participants are asked to select a training, either mathematics or vocabulary. Then, no matter which training was selected, an additional randomization determines the actual training received. The purpose of asking participants to select training but then randomizing assignment despite their selection is to sort them into groups based on which training they would have selected. This is the key innovation that enables estimation of conditional average treatment effects for the treated and untreated groups.


Study participants will be recruited online through Amazon’s Mechanical Turk (MTurk) marketplace. In line with Shadish, Clark, & Steiner, I collect baseline measures of demographic information, math and vocabulary aptitude, social and emotional health, and preferences related to mathematics and vocabulary. I expect to have data collected by late May and analyzed by July.


To my best knowledge, this study represents the first WSC that allows for comparisons of experimental and quasi-experimental estimates for conditional ATEs. Rapid developments in the theory and application of causal effect estimation has led to many open questions about real-world finite sample performance. The six arm WSC design is novel because it will provide immediate opportunities to test the ability of conditional and IV causal effect estimation approaches to replicate randomized benchmarks for both overall and conditional ATEs.


Estimates from the randomized benchmarks and corresponding quasi-experiments will be presented and compared. Covariate balance, propensity score overlap, and graphical diagnostics will be presented for the quasi-experimental comparisons. The pattern of concordance of estimates for the quasi-experiments with the experimental benchmark will be measured as proportion of bias reduction and with statistical tests of equivalence. Practical considerations for the implementation of six-arm design replications studies will also be discussed.