Panel: The Generalizability of Impact Evaluations Findings: New Empirical Evidence
(Tools of Analysis: Methods, Data, Informatics and Research Design)

Thursday, November 2, 2017: 1:45 PM-3:15 PM
Field (Hyatt Regency Chicago)

*Names in bold indicate Presenter

Panel Organizers:  Robert Olsen, Rob Olsen LLC
Panel Chairs:  Kelly Hallberg, University of Chicago
Discussants:  John Deke, Mathematica Policy Research

How Much Can External Validity Bias Be Reduced by Aligning Sample and Population on School District Characteristics?
Stephen Bell1, Robert Olsen2, Larry Orr3, Elizabeth Stuart3 and Michelle Wood1, (1)Abt Associates, Inc., (2)Rob Olsen LLC, (3)Johns Hopkins University

Using Rigorous Evaluation Results to Improve Local Policy Decisions
Larry Orr1, Stephen Bell2, Robert Olsen3, Elizabeth Stuart1, Azim Shivji2 and Ian Schmidt1, (1)Johns Hopkins University, (2)Abt Associates, Inc., (3)Rob Olsen LLC

While impact evaluations provide evidence on causal effects for the study sample, a key question is whether the evidence generalizes beyond the study sample to allow policymakers to predict the consequences of policy decisions like whether to adopt or scale-up an intervention or whether to cancel a program.   To date, there has been some research on the theory of the problem (e.g., Olsen et al., 2013, Tipton et al., 2013) and improved designs and analysis methods to address the problem (e.g., Cole and Stuart, 2010; Kern et al., 2016, Olsen and Orr, 2016; Tipton 2014). This panel will add to a small but growing body of evidence (e.g., Allcott, 2015; Bell et al., 2015; Stuart et al., 2017) on the generalizability of impact evaluation findings beyond their samples to the populations of interest for policy. 

This session will offer additional evidence on the extent to which the findings from impact evaluations—in particular, impact evaluations of educational interventions—generalize to other populations.  The first paper, by Larry Orr and colleagues, presents empirical evidence on the generalizability of results from several multi-site impact evaluations in education to individual sites.  The second paper, by Sean Tanner, presents empirical evidence on the generalizability of the samples selected for impact studies that have been reviewed by the What Works Clearinghouse; it also provides evidence on whether findings from early studies reviewed by the WWC were replicated by later studies of the same intervention. The third paper, by Stephen Bell and colleagues, provides evidences on the extent to which the external validity bias from purposive site selection can be reduced using publicly available data on schools and districts (e.g., from the Common Core of Data) and simple statistical methods (e.g., linear regression models, propensity score matching).  


Allcott, H. (2015). Site selection bias in program evaluation. The Quarterly Journal of Economics, 130(3), 1117-1165.

Cole, S.R. and Stuart, E.A. (2010). Generalizing evidence from randomized clinical trials to target populations: the ACTG-320 trial. American Journal of Epidemiology, 172: 107-115.

Kern, H.L., Stuart, E.A., Hill, J., and Green, D.P. (2016).  Assessing methods for generalizing experimental impact estimates to target populations.  Journal of Research on Educational Effectiveness, 9(1), 103-127. 

Olsen, R.B. & Orr, L.L. (2016).  On the “where” of social experiments:  Selecting more representative samples to inform policy. Special issue on Social Experiments in Practice, New Directions for Evaluation.

Olsen, R. B., Orr, L. L., Bell, S. H., & Stuart, E. A. (2013). External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32(1), 107-121.

Stuart, E. A., Bell, S. H., Ebnesajjad, C., Olsen, R. B., & Orr, L. L. (2017). Characteristics of school districts that participate in rigorous national educational evaluations. Journal of Research on Educational Effectiveness, 10(1), 168-206.

Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38(3), 239-266.

Tipton, E. (2014) Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2): 109-139.