Challenges in the Data Collection and Analysis of Big Data in the Public Sector

Rethemeyer, R. Karl; Rethemeyer, R. Karl

The promise of big data resides in the profusion of rich, prompt, granular data on behaviors and phenomena that were expensive and sometimes impossible to quantify in the past. For commercial enterprises, the exploitation of this data – so long as it stays within the (admittedly weak and incomplete) legal frameworks in place – can often turbocharge the financial bottom line (McKinsey Global Institute, 2011). The value and payoff for public operations and public affairs research from big data is a more nuanced and contingent question. Can big data create “public value” to use Mark Moore’s term? We believe it can, but there are substantial concerns that are specific to the public affairs context that must be carefully considered.

In this paper we explore a set of issues with public sector use of big data for operations and research that are primarily methodological but also stem from the particular intersection of methods, operational context, and sourcing of big data. Our analysis focuses on four particular areas of concern. First, most big data can be characterized as “digital exhaust” – that is, data generated by commercial and public entities “as a by-product of other activities (Manyika et al., 2011, p. 1).” While cheap and increasingly accessible, digital exhaust is not constructed to standards expected for academic research or even high-quality evidence-based management. We will examine the early evidence on the consequences of relying on digital exhaust, drawing particularly on the case of Google Flu Tracker (Lazer et al 2014).

Second, many sources of big data are “public” but until the Age of Internet were not particularly accessible. As a result, there are substantial concerns regarding public assumptions about privacy and the legitimacy of using public but formerly inaccessible data without some form of enhanced consent. We will consider briefly the case of real estate tract data and how big data is allowing government to create “dossiers” on individuals that may circumvent both constitutional and statutory limits on search and seizure.

Third, while our tools make it increasingly easy to link together disparate data sources, less thought has been given to the potential to easily identify individuals even if the individual sources are deidentified. This issues cross into thorny issues of research and managerial ethics that will be explored, particularly with reference to studies that demonstrate the ease with which identity may be inferred through linked data sets.

Finally, because big data often originates from Internet sources, researchers and public managers must consider carefully the potential for bias in big data that are implicit in the demographics of Internet users and Internet content creators.

This paper will conclude with a set of preliminary thoughts on principles that public sector practitioners and public affairs researchers should consider when acquiring, analyzing, and acting upon big data.

Association for Public Policy Analysis & Management

Panel Paper: Challenges in the Data Collection and Analysis of Big Data in the Public Sector