Panel Paper: Covariate Selection for PS Designs In the Absence of Substantive Theories

Saturday, November 10, 2012 : 10:55 AM
Washington (Sheraton Baltimore City Center Hotel)

*Names in bold indicate Presenter

Peter Steiner, University of Wisconsin - Madison, Thomas Cook, Northwestern University IPR and Wei Li, Northwestern University


The choice of covariates for removing selection bias from observational studies is crucial. All the selection bias can be removed only if the selection mechanism is ignorable, that is, if all confounders of treatment selection and potential outcomes are available and reliably measured. Ideally, covariates are selected according to well-grounded substantive theories about the selection process and the outcome-generating model. However, with weak or no theories about these two matters, covariate selection strategies become more heuristic. This paper examines the bias reduction achieved by combining covariates that vary in number per domain and covariates that vary in the number of conceptually heterogeneous domains from which they are sampled.

Using the within-study comparison of Shadish, Clark & Steiner (2008) and the ECLS-K dataset, we investigate three research questions: First, how important is it to have a large and heterogeneous set of covariates? Second, how important is it to sample multiple items per domain? And since each data set enables us to identify the likely true selection process, the third question we address is: How much bias reduction is achieved by failing to include the most effective covariates in the set used to correct for selection bias? Thus, will a large and heterogeneous set of covariates compensate for the absence of the most effective covariates? These questions get at what can be known about bias reduction when theory about selection is meager but the number and dimensionality of covariates independently vary.

The results from both studies, the within-study comparison and the ECLS-K dataset, indicate that bias reduction increases as the number of covariates per domain increases, though at a diminishing rate. Sampling covariates from multiple heterogeneous construct domains also increases bias reduction and is more important than having many measurements of a few domains only. Combining the maximal heterogeneity sampled and at least five items per domain reduces almost all the bias in the two educational data sets examined. When the most effective single covariates are deliberately omitted – which no analyst would ever do in practice -- bias reduction again increases as a joint function of the heterogeneity of domains and the number of items per domain, but it is open to debate whether the level of bias reduction achieved is acceptable or not. This is hardly the case when what turn out to be the crucial covariates are included among the covariates sampled.