Assessing the Validity of Comparative Interrupted Time Series Designs and Practice: Lessons from Two within Study Comparisons

Hallberg, Kelly; Hallberg, Kelly

Applied researchers are using comparative interrupted time series (CITS) designs with increasing frequency to examine the effects of programs and policies. The design has been used to study policy issues such as the effects of school turnaround in Chicago (de la Torre et al., 2012) and No Child Left Behind Act, nationally (Dee, Jacob, & Schwartz, 2013), and it is increasingly being adopted to study the effects of programs being implemented under evidence based policy initiatives, such as the U.S. Department of Education’s Investing in Innovation (i3) program (U.S. Department of Education, 2013). In a simple ITS design, researchers compare the pre-treatment values of a treatment group time series to post-treatment values in order to assess the impact of a treatment, without any comparison group to account for confounding factors. The CITS design is a version of the ITS design in which both a treatment and a comparison group are evaluated both before and after the onset of a treatment.

CITS design usage may be on the rise, but little is known about the conditions under which the design supports causal inference in practice. A modest but growing body of work seeks to examine the validity of ITS analyses through WSCs comparing their estimates to those from experimental designs. Results from several recent with study comparisons (WSCs) provide some reason for optimism about the performance of CITS. WSCs both have shown that CITS can produce results that are very similar to those from an RCT (Schneeweiss, Maclure, Carleton, Glynn, and Avorn, 2004; Fretheim, Soumerai, Zhang, Oxman, and Ross-Degnan, 2013; Somers, Zhu, Jacob, and Bloom, 2013; St. Clair, Cook, & Hallberg, 2014; St. Clair, Hallberg, and Cook, under review). However, in some cases, this correspondence is dependent on modelling choices made by the researcher as well as the stability of the pretreatment trend (St. Clair, Cook, & Hallberg, 2014; St. Clair, Hallberg, and Cook, under review).

Applied researchers face two primary analytic decisions when implementing a CITS design: (1) how to model the pretreatment trend and (2) how to select a comparison group. This study draws on data from two empirical within study comparisons to examine the implications of these decisions. The first empirical WSC draws on data from an RCT studying the effects on an online mathematics program and the second an RCT examining the effect of a whole-school reform model. Using these datasets, we estimate the ability to reproduce RCT results and calculate the degree of bias remaining after implementing three modeling approaches: (1) baseline mean; (2) the baseline slopes; (3) year-fixed effects. In addition, we examine the performance of these three methods when supplemented with identifying a comparison group identified in four ways: (1) using all available non-treatment cases; (2) matching on pretreatment measures of the outcome; (3) identifying geographically local matches; and (4) implementing a hybrid matching approach which combines matching on pretreatment measures of the outcome and local matching.

Association for Public Policy Analysis & Management

Panel Paper: Assessing the Validity of Comparative Interrupted Time Series Designs and Practice: Lessons from Two within Study Comparisons