Estimating Labor Market Returns to an Early Childhood Intervention: A Comparison of Survey, Administrative, Matched, and Imputed Data

Arteaga, Irma; Arteaga, Irma

A growing literature compares administrative, survey, and matched earnings data which tend to have significant amounts of missing values (Britton, Shephard, & Vignoles, 2015; Dahl, DeLeire, & Schwabish, 2011; Hotz & Scholz, 2002). The way scholars handle missing earnings can affect model estimates and standard errors (Kleinke et al., 2011; Mohadjer, & Choudhry (2002; Penn, 2007); however, there is not an agreement in the literature about the best way to correct for missing earnings data. While some prefer to drop missing cases (Dahl, Deleire & Schwabish, 2011), others prefer to impute them. Imputation methods vary from regression analysis (Gertler et al., 2014), multiple imputation (Briton, Shephard & Vignoles, 2015; Chen & Fu, 2015; Dragoset & Fields, 2008; Penn, 2007) and simple imputation (Ryder et al. 2011). To our knowledge, there is limited research that compares different data sources and modes of imputing earnings specifically for low-income individuals. Using data from the Chicago Longitudinal Study (CLS), our study adds to the literature by examining earnings data from survey, and administrative sources; as well as combining both sources and imputing earnings for missing observations for a single cohort of about 1,500 low-income individuals who participated in a quasi-experiment in the mid-1980s in Chicago that provided an enriched preschool program.

The CLS collected self-report income information at age 33-36 and also acquired administrative records on earnings for the same period. Preliminary findings indicate that self-reported income is 20 percent higher than the income from administrative records. Our adjusted models show that earnings of the preschool group are 13 percent higher than earnings of the comparison group. Results for non-imputed values were very similar than to those using multiple imputation (13%-16%), and lower than the ones that used regression analysis or single imputation (20%-21%). All these results used a Heckman-correction for censoring. We found that censoring correction is important to use because we only observe the earnings of those employed who are in the labor force. Moreover, when we do not use censoring correction, the estimated effects of the program are 10 percentage points higher in comparison to those that use a censoring correction. We also tested the sensitivity of our results to selection, samples (administrative, survey or combined) and used different predictors for imputation. We also pay close attention to the earnings of the formerly incarcerated in light of the new results in Schanzenbach et al (2016).

This study aims to promote discussion among policy researchers on challenges of estimating earnings with missing data and when only one source of data is available (self-report or administrative records). It also provides a practical guidance on basic questions like “when to impute missing values for income?”, “what type of information do we need to have?”, “what type of imputation technique is more appropriate to use?”, “how sensitive are our results to censoring correction?” We propose strategies toward a more rigorous evaluation of earnings with missing values.

Association for Public Policy Analysis & Management

Panel Paper: Estimating Labor Market Returns to an Early Childhood Intervention: A Comparison of Survey, Administrative, Matched, and Imputed Data