Panel Paper: Quantifying the Policy Reliability of Competing Non-Experimental Methods for Measuring the Impacts of Social Programs

Thursday, November 2, 2017
McCormick (Hyatt Regency Chicago)

*Names in bold indicate Presenter

Stephen Bell1, Hiren Nisar1, Claudia D. Solari1 and Larry Orr2, (1)Abt Associates, Inc., (2)Johns Hopkins University

Determining with confidence the effectiveness of public social programs requires randomized control trial (RCT) evidence free from selection bias and omitted confounders—or impact analysis methods as reliable as RCTs based on non-experimental data. A large literature seeks to identify the latter methods by aligning quasi-experimental design (QED) impact evidence to experimental benchmarks, a literature known variously as “design replication studies” and “within-study comparison designs.” But, how close is close enough in this alignment when looking for reliable policy guidance?

Two decades ago, Bell and Orr (1995) introduced the only known method for formally quantifying the policy reliability of QED findings against an experimental benchmark. Their method, based on Bayesian statistical theory, computes a “maximum risk function” showing the probability of an incorrect policy decision for different magnitudes of true impact thought by policymakers sufficient to justify continued or expanded funding for the studied intervention. Bell et al. (1995) applied the method to three QED methods used to measure the impact of job training interventions in the face of selection bias. No other applications are known.

The current paper applies the Bell-Orr methodology to the large body of RCT replication efforts using QED methods that exists today. It reassesses conclusions regarding QED methods judged by the original authors as providing an adequate substitute for an experiment. For any given paper, the adequate-substitute estimate with the smallest standard error is scrutinized as the single most informative case of claimed success. So too is the QED estimate with the smallest standard error among those not judged by the authors to provide an adequate substitute for an experiment—thus assuring balanced examination of both favorable and unfavorable conclusions on the part of contributors to the literature on methods reliability.

Applying the Bell-Orr criterion of policy reliability to the accumulated set of design replication/within-study comparison design results indicates the degree of trust the profession should place in claims from the literature regarding reliance on non-experimental methods when measuring impacts of social programs. Quite different results emerge than one would have thought from simply reading and accepting at face value the past literature.