Panel Paper:
Automated Census Record Linking: A Machine Learning Approach
Thursday, November 2, 2017
Dusable (Hyatt Regency Chicago)
*Names in bold indicate Presenter
In this paper, I detail a machine learning approach to record linkage to construct longitudinal historical samples. Newly digitized complete historical census records have brought big data to economic history. Linking individuals over time and between databases has opened up avenues for research into intergenerational mobility, the long-run effects of early life conditions, assimilation, discrimination, and the returns to education. To take advantage of these new research opportunities, scholars need to be able to accurately and efficiently match historical records and produce an unbiased dataset of links for analysis. The procedure I propose applies insights from machine learning classification and text comparison to record linkage of historical data.
Linking historical data---without unique identification numbers---is difficult and imprecise, relying on demographic information like name, age, and place of birth. However, these variables may be mismeasured, including transcription errors, spelling mistakes, name changes, or name shortening. Manual linking by a trained researcher yields accurate and comprehensive matches, but at the cost of time and replicability. Prior algorithmic approaches have been developed in the historical literature, but their rigid rules are often quite inefficient---many records go unmatched---and inaccurate in the face of messy historical data. My technique uses supervised learning to train an algorithm to replicate the process of manually matching individual records across sources. I am thus able to increase the speed, accuracy, and consistency of creating historical linked samples.
I detail the specifics of the linking method in the paper but give a brief overview here. I begin by cross matching two census-like datasets and extracting a wide subset of possible matches for each record. I then build a training dataset on a small share of these possible links, manually identifying whether a given record pair is a link or not. This is the data used to tune the matching algorithm. For every record pair, I generate a large set of features---string distance in first and last name, difference in year of birth, Soundex indicators, agreement on middle initial, total possible matches for a given record, etc. When a researcher is making matches manually, each of these features has some weight: differences in last name are penalized more than differences in first names or string differences early in names are penalized more than string differences at the end of names. However, these weights are only implicit, and a researcher might struggle to write them down. The algorithm, observing the features and the match or not-match outcome, attempts to minimize both false positives and false negatives and estimates the weights explicitly, learning the rules used by a well-trained and consistent researcher.