Panel Paper: How Well Propensity Score Methods Approximate Experiments Using Pretest and Demographic Information in Educational Research?

Friday, November 7, 2014 : 10:55 AM
Apache (Convention Center)

*Names in bold indicate Presenter

Nianbo Dong, University of Missouri and Mark W. Lipsey, Vanderbilt University
Cook and colleagues have used within-study comparisons to identify under what conditions (e.g., covariates selection, matching within or between locations/clusters, etc.) quasi-experiments can replicate experiments. Some useful suggestions about constructing a good comparison group have been made, e.g., using local matching and including pretests in matching (Cook, Shadish, & Wong, 2008; Michalopoulos, Bloom, & Hill, 2004; Steiner, Cook, Shadish, & Clark, 2010). In particular, Wong, Hallberg, & Cook (2013) examined the relative importance of focal and local matching and concluded that intact school matching within districts is capable of replicating experimental estimates. Although advances have been made in this area, as Cook (2012) suggested, more within-study comparisons are needed to assess the robustness of the ability of well designed and implemented quasi-experiments to replicate experiments across different populations, settings, and times, etc.

The purpose of this study is to analyze data from four IES-funded projects to assess how well propensity score methods can approximate the results of randomized comparisons when control groups from other studies are substituted for the original randomized controls. The effects found in a randomized experiment of the Building Blocks pre-k math curriculum in Tennessee schools provided benchmarks for internally valid estimates. By replacing the control group in this study with control groups from the other studies, we examined how closely propensity score estimates in those nonrandomized comparisons matched the effect estimates from the original randomized study. The nonrandomized comparisons were constructed in two ways: (1) control samples from different states substituted for the original controls, and (2) control samples from other studies within the same state and districts (local matching) substituted for the original controls. Student demographic information and pretest were used to estimate the propensity scores.

Three types of propensity score methods were used to estimate the average effect of the treatment on the treated (ATT): (1) one-to-one optimal matching, (2) weighting by the odds of the propensity score, and (3) stratification. The point estimates and their 95% confidence intervals were then compared with the benchmark estimates. Bias was calculated by the difference in point estimates between the propensity score results and the benchmark taking into account the estimation errors represented by the confidence intervals.

Our results for comparison groups from other states showed propensity score estimates that did not depart from the benchmark estimates with regard to statistical significance but which, nonetheless, showed sizable bias that varied from modest to substantial in magnitude. For comparison groups from the same state and districts, most of the propensity score estimates also produced the same statistical significance conclusion as the benchmark estimates. Nonetheless, here also many comparisons showed sizable bias in the point estimates. The propensity score methods in this exploration, therefore, did not assure unbiased estimates though they did generally produce the same conclusions about whether the treatment effect was statistically significant. In sum, propensity score methods may be useful when randomization is not feasible, but “how close is close enough” (Wilde & Hollister, 2007) still remains a question.