What's in a Name? Multi-Source String Distance Record Linkage with Machine Learning Optimization to Produce a Comprehensive Longitudinal State P20W Dataset

Seith, David C.; Seith, David C.

This paper describes the use of several administrative data sources, string comparison metrics, and decision rules to create a comprehensive preschool to postsecondary longitudinal database of annual education, employment, and earnings outcomes.

Since 2002, nearly every state has participated in the National Center for Educational Statistics’ Statewide Longitudinal Data Systems (SLDS) Grant Program, building a longitudinal data system of annual educational outcomes, following each public school student from preschool to through high school (PreK-12).

With additional resources from the Department of Labor’s Workforce Data Quality Initiative (WDQI), many states have integrated longitudinal employment data into P20W data systems, following student employment and earnings for up to 10 years beyond high school.

States who take this course face two sets of challenges.

The public administration challenge is to identify and secure access to the appropriate state administrative records datasets. We utilized four datasets: Education, Motor Vehicle, Employment Services, and Unemployment Insurance Wage data.

There are three data science challenges.

First, analysts need to organize a sequence of deterministic and probabilistic matches. In the course of our research, we identified a ten-step sequence, progressing from “platinum,” deterministic matches, to multi-source “gold” triangulation matches and probabilistic “silver” fuzzy matches.

Second, analysts need to choose effective string distance metrics to evaluate the validity of fuzzy matches. We chose the Jaro-Winkler and Jaccard metrics.

Third, in the absence of a “gold standard” authoritative dataset, analysts can select a cut-off matching score based on a specified tolerance for a false positive rate and clerical review.

In this matching application, we find that most (82.1 percent) of New Jersey high school exiters can be matched within 1 – 5 years after high school. Nearly 90 percent of all links are exact, deterministic matches, with little opportunity for mismatches. Together this high match rate and low mismatch rate make the New Jersey SLDS a promising source for understanding the post-high school education and labor market experiences of young adults, in general, as well as for evaluating specific interventions designed to improve those experiences.

We are confident that our efforts to improve government effectiveness in building and learning from this database will be of interest to several audiences engaged in research, including: policymakers engaged in similar SLDS efforts, academics and graduate students who hope to use datasets like the SLDS to analyze school to labor market transitions, and data scientists involved in creating similar longitudinal administrative records datasets.

Association for Public Policy Analysis & Management

Panel Paper: What's in a Name? Multi-Source String Distance Record Linkage with Machine Learning Optimization to Produce a Comprehensive Longitudinal State P20W Dataset