Panel Paper: Creating a Secure Environment to Integrate Data and Build Data Science Skills

Thursday, November 8, 2018
8224 - Lobby Level (Marriott Wardman Park)

*Names in bold indicate Presenter

Julia Lane, New York University


In this presentation, Julia Lane will discuss the origin and goals of the Administrative Records Data Facility (ADRF) and the innovative training approach built around it. She will focus on how this initiative will be incorporated into the TDI project.

In order to harness the potential of administrative data, the core challenge that must be addressed is to professionalize community access to and use of data on human subjects that has historically been limited, artisan, and ad-hoc. The professionalization involves three key steps. The first of these is technical: to provide a secure environment within which data providers can place and share their data across agency and jurisdictional lines. The second is operational: to create enough capacity to link disparate data. The third is both legal and practical: to ensure that there is a value associated with the data linkage that is both consistent with the agency mission and useful enough to engage decision-makers. While the third is the most important in terms of creating institutional buy-in and will, the first and second are necessary before the third can happen.

The ADRF is built on many years of successful experience to design an infrastructure that incorporates these steps. It was commissioned by the US Census Bureau to inform the decision making of the Commission on Evidence-Based Policymaking. In the past year, ADRF has provided services to almost 180 government agency staff and researchers, and hosted almost 50 confidential government datasets from 12 different agencies. The ADRF’s ability to acquire and link these confidential data is evidence that the substantial legal and political hurdles exist can be surmounted if the technical issues associated with providing a secure environment can be addressed and the value proposition is well-defined.

For the TANF Data Innovations project, the ADRF will be a platform for intensive data science training. We have developed training classes that create a sandbox environment within which agency staff – not outside vendors – build concrete evidence of the value of linking data. The approach has to be built around agency needs and use modular learning approaches. Our classes (i) create a pipeline of new product prototypes central to agency missions, (ii) develop teams of skilled practitioners who have the capacity to both link data and apply modern analytical approaches to cross agency problems, and (iii) make a growing set of linked data available as an ongoing asset for budget analysis and program management.

The class itself was structured to train staff in agencies contributing data and other state and Federal staff in developing such skills as managing and linking the relevant data; applying text analysis, network analysis and machine learning tools; thinking about inference and privacy and confidentiality issues; and visualizing the results. We developed Jupyter Notebooks (a new means of combining code with rich text documentation) for each skill set using the linked data so that participants could also use them in team research projects.