Panel Paper: Asymdystopia: The Threat of Small Biases in Evaluations of Education Interventions That Need to be Powered to Detect Small Impacts

Saturday, November 10, 2018
Wilson B - Mezz Level (Marriott Wardman Park)

*Names in bold indicate Presenter

John Deke1, Thomas Wei2 and Tim Kautz1, (1)Mathematica Policy Research, (2)U.S. Department of Education


Evaluators of policy interventions increasingly need to design studies to detect impacts much smaller than the 0.20 standard deviations that Cohen (1988) characterized as “small.” However, the drive to detect smaller impacts may create a new challenge for researchers: the need to guard against relatively smaller biases. When studies were designed to detect impacts of 0.20 standard deviations or larger, it may have been reasonable for researchers to regard small biases, such as 0.03, as ignorable. But in a study that detects much smaller impacts, such as 0.03 standard deviations (e.g., Chiang et al. 2015), a bias of 0.03 standard deviations is huge.

This paper examines the potential for small biases to increase the risk of making false inferences as studies are powered to detect smaller impacts, a phenomenon we refer to as asymdystopia. We examine this potential for two of the most rigorous designs commonly used in evaluation research.

First, we consider the role of attrition bias in randomized controlled trials (RCTs) as studies are powered to detect smaller impacts. To do so, we use an attrition model for RCTs used in several federal evidence reviews, including the What Works Clearinghouse (WWC 2013; 2014). Using this model and data from the WWC on attrition from more than 800 prior studies, we show how attrition may become less acceptable, leading to higher rates of false inferences, as studies are powered to detect smaller effects. We also use the WWC data to consider the feasibility of achieving lower attrition rates in studies that are powered to detect small impacts.

Second, we examine functional form misspecification bias in Regression Discontinuity Designs (RDDs) as studies are powered to detect smaller impacts. To do so, we use Monte Carlo simulations to assess what happens as the sample size of the RDD increases under varying assumptions regarding the true functional form. The data generating processes used for these simulations are based on data from several prior large-scale RCTs in education (James-Burdumy et al. 2010; Constantine et al. 2009; Campuzano et al. 2009). Specifically, we examine the effect of a larger sample size on statistical power, functional form misspecification bias, and the accuracy of estimated p-values. Our simulation findings show that with conventional estimation, Type 1 error rates go up as studies are powered to detect smaller impacts, but that the robust estimation approach that Calonico et al. (2014) recommend solves this problem.

Overall, our findings suggest that biases that might have once been reasonably ignorable can pose a real threat in evaluations that are powered to detect small impacts. This paper identifies and quantifies some of these biases, and shows that they are important to consider when designing evaluations and when analyzing and interpreting evaluation findings. We also discuss potential strategies to address these biases. Our findings should not be interpreted as suggesting that researchers should avoid powering evaluations to detect small impacts. The problem of small biases is real but surmountable—so long as it is not ignored.