Panel Paper: Combining Family History and Machine Learning to Link Historical Records

Friday, November 8, 2019
Plaza Building: Concourse Level, Governor's Square 15 (Sheraton Denver Downtown)

*Names in bold indicate Presenter

Joseph Price1, Kasey Buckles2, Isaac Riley1 and Jacob Van Leeuwen1, (1)Brigham Young University, (2)University of Notre Dame


A key challenge for research on many questions in the social sciences is that it is difficult to link administrative records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we propose a new approach that can be used in conjunction with other methods to link individuals across United States Census records. We focus specifically on a method for creating large training sets that can be used with supervised machine learning algorithms. Training data plays a key role in supervised machine learning algorithms and the lack of training data has been one the main barriers to using these methods to link historical records. Unlike previous methods that use a resource-intensive process that relies heavily on skilled human trainers to create their training data, our approach instead relies on the decisions that are made by millions of people who are researching their own family histories. The key feature we exploit is that when the profile for a deceased individual on one of these websites has multiple sources attached, each pair of these sources can potentially be used to train the data to make new matches. Thus, these profiles provide a relatively low-cost way to create very large training sets with multiple sources of information attached. The training data are highly reliable, as the family members doing the linking typically have private information that can be used to identify the person of interest across multiple data sets. Our large training data set allows us to examine several important decisions that need to be made when using a machine learning approach to link historical records. These decisions include which features are used to identify matches (blocking), whether to pre-process the data (for example, by identifying common nicknames), how much training data to use, which machine learning algorithm to use, and how to evaluate the quality of the matches that are created. We use three key measures of success in record linking to guide our decision-making: the match rate (or recall), the false link rate (or precision) and representativeness (how the sample compares to the population of interest). Ultimately, we are able to identify over 70% of the potential matches among the three censuses, with high levels of precision and representativeness.

Full Paper: