Panel Paper: Improved Data for Science Policy Through Patent Data and the STAR Metrics API

Saturday, November 10, 2012 : 1:45 PM
Chesapeake (Sheraton Baltimore City Center Hotel)

*Names in bold indicate Presenter

Ron Lai, National Science Foundation and Lee Fleming, University of California Berkeley

Relatively complete patent data for the United States became widely available in the mid-1990s.  While these data enabled research in the fields of technology and innovation, the publication of a curated patent dataset from the National Bureau of Economic Research (see Hall, Trajtenberg, and Jaffe, 2001) provided much broader access.  Unfortunately, the dataset does not identify inventors over time, a non-trivial task because the U.S. Patent Office does not require consistent and unique identifiers for inventors.

This paper describes a new database available to the research community.  Using a Bayesian supervised learning approach, we disambiguated all inventor names from the U.S. utility patent database, from 1975 to the end of 2010.  We provide an overview of the disambiguation methods, assess their accuracy, characterize the resulting dataset, calculate network measures based on co-authorship, and provide illustrative examples.  The dataset is available at the Patent Network Dataverse (

The paper also describes a data platform (API) that aims to improve access to federal agency grant data and its associated outcomes, which includes patents.  The initial approach has been successfully applied to the National Science Foundation and is supplemented with programmatic access to a public facing Application Programming Interface.  We also experimented with algorithmic approaches such as topic modeling to further create internal data linkages and expose data patterns.  It offers exciting possibilities to drastically reduce the barriers to entry to a larger community of researchers and the potential to develop new data tools that can be built on top of automatically updated data sources.