Panel: [DATA] New Developments in the Empirical Evaluation of Non-Experimental Methods
(Tools of Analysis: Methods, Data, Informatics and Empirical Design)

Friday, November 7, 2014: 10:15 AM-11:45 AM
Apache (Convention Center)

*Names in bold indicate Presenter

Panel Organizers:  Vivian C. Wong, University of Virginia
Panel Chairs:  Phil Gleason, Mathematica Policy Research
Discussants:  Thomas Cook, Northwestern University


Theoretical Foundations in the Design, Implementation, and Analysis of within Study Comparisons for Evaluating Quasi-Experimental Approaches
Vivian C. Wong, University of Virginia and Peter Steiner, University of Wisconsin - Madison



A Design-Replication Study with Panel Data and Two Control Groups
Casey Wichman, University of Maryland and Paul Ferraro, Georgia State University


Over the last three decades, a research design has emerged to evaluate the performance of non-experimental designs and design features in field settings. It is called the within-study comparison (WSC) approach, or design replication study. In the traditional WSC design, treatment effects from a randomized experiment are compared to those produced by a non-experimental approach that shares the same target population. The non-experiment may be a quasi-experimental design, such as a regression-discontinuity or an interrupted time series design, or an observational study approach that includes matching methods, standard regression adjustments, and difference-in-differences methods. The goals of the WSC are to determine whether the non-experiment can replicate results from a randomized experiment (which provides the causal benchmark estimate), and the contexts and conditions under which these methods work in practice. The earliest WSC designs used data from job training evaluations to compare results from a non-experimental study with those from an experimental benchmark that shared the same treatment group. Non-experimental methods were used to match comparison units from extant datasets, such as the Current Population Survey, to the experimental treatment group. The early conclusion from most of these studies was that non-experimental methods failed to produce results that were comparable to their experimental benchmark estimates. Results from these early WSCs had profound influence on research practice and priorities. The Office of Management and Budget cited results from early WSCs in their 2004 recommendation that federal agencies should use randomized experiments for evaluating program impacts, cautioning against the use of “comparison group studies” that “often lead to erroneous conclusions” (OMB, 2004). More recent WSCs have revealed promising cases in which non-experimental approaches were able to replicate experimental benchmark results: (1) When treatment and comparison units are assigned to conditions based on an assignment variable and a cutoff, as in the case of the RD; (2) When intact groups (e.g. schools, villages) are matched “focally” using rich covariate information and “locally” in the same geographic area; and (3) When the selection process is known and observed by the researcher. The purpose of this panel is to examine within-study comparisons as an approach for evaluating non-experimental methods in field settings, as well as to highlight results from two new WSCs. In the first paper, Vivian Wong and Peter Steiner discuss the multiple purposes of WSC approaches, and highlight unique design, implementation, and analysis issues of WSCs for yielding credible results. In the second paper, Paul Ferraro and Casey Wichman present results from their WSC, which uses data from a large-scale experiment in an environment policy context to evaluate the performance of fixed effects, panel data design approaches. Finally, Nianbo Dong and Mark Lipsey present results from their WSCs that examine the performance of propensity score matching methods using experimental and non-experimental data from a pre-kindergarten evaluation. Together, these papers provide researchers with guidance on evaluating non-experimental approaches in field settings, as well as results from two new WSCs that examine the performance of panel data and matching methods in two new policy contexts.