Panel Paper: Validity and Precision of the Difference-In-Difference and Comparative Interrupted Time Series Designs In Educational Evaluation

Saturday, November 10, 2012 : 10:55 AM
International A (Sheraton Baltimore City Center Hotel)

*Names in bold indicate Presenter

Marie-Andree Somers1, Pei Zhu1, Robin Tepper Jacob2 and Howard Bloom1, (1)MDRC, (2)University of Michigan

Because randomized experiments are not always feasible, impact evaluations must often rely on a quasi-experimental design (QED) instead. In this paper, we examine the validity and precision of two promising QEDs for educational evaluation: the difference-in-difference (DD) design and the comparative interrupted time series (CITS) design. The DD design evaluates the impact of a program by looking at whether the treatment group deviates from its baseline mean by a greater amount than the comparison group. In contrast, with a CITS design, program impacts are evaluated based on whether the treatment group deviates from its baseline trend by a greater amount than the comparison group. The CITS design has more stringent data requirements than the DD design: scores must be available for at least 4 time points before the intervention begins in order to estimate the baseline trend. However, the CITS design is a more rigorous design in theory, because it implicitly controls for differences in the baseline mean and trends between the treatment and comparison group. This paper examines the properties of these two designs in the context of the federal Reading First program, as implemented in a large Midwestern state. This particular example is chosen for two reasons. Most importantly, the true impact of Reading First in this Midwestern state is known, because program effects can be evaluated using a regression discontinuity (RD) design. The RD design is as rigorous as a randomized experiment under certain conditions; therefore, the RD impact estimate provides a strong “benchmark” against which to compare the findings obtained from the DD or CITS design. Second, several years of reading test scores are publically available from the Midwestern state, which makes is possible to use a CITS design to evaluate Reading First. Using these data, we explore several questions. First, we examine whether a well-executed CITS design and/or DD design can produce valid inferences about the effectiveness of a school-level intervention such as Reading First. Second, we explore the trade-off between bias reduction and precision across different methods of selecting comparison groups for the CITS/DD designs (e.g., one-to-one vs. one-to-many matching, matching with vs. without replacement). Third, we examine whether matching on baseline demographic characteristics (in addition to baseline scores) further improves the validity of the impact estimates. And finally, we examine how the CITS design performs relative to the DD design, with respect to bias and precision. Overall, we find that both the CITS and DD designs provide valid inferences about the effectiveness of Reading First; the magnitude of the bias is at most 0.02-0.03 in effect size. Though the DD design performs as well as the CITS design, this result may be specific to the particular circumstances of Reading First. We conclude that all comparison group selection methods (e.g., one-to-one vs. one-to-many) provide correct inferences about impacts, but that estimates from some methods are more precise due to the larger sample size. Finally, we find that matching on demographic characteristics (in addition to test scores) does not further reduce bias.