Panel Paper: Mapping Organizations By What They Do: Text-As-Data Applications to Nonprofit Classification

Saturday, November 9, 2019
Plaza Building: Concourse Level, Plaza Court 8 (Sheraton Denver Downtown)

*Names in bold indicate Presenter

Haohan Chen, University of Pennsylvania and Ruodan Zhang, Indiana University, Bloomington


In the “big data” era, public policy researchers face new opportunities and challenges in extracting useful information from unstructured and massive organizational data. This study introduces how text-as-data methods, specifically, topic modeling and word embedding, can be applied to organization categorization. We demonstrate the method application using public charities’ mission and activities information in the IRS Form 990s.

Public charities in the U.S. are legally defined under the tax code section 501(c)(3) as primarily public-serving organizations. To date, there are over 1 million public charities, covering a diverse range of service areas such as arts, health, education, human services, religion, etc. (McKeever, 2018). To break down the sub-sectoral variations, the National Center for Charitable Statistics developed the National Taxonomy of Exempt Entities (NTEE) as an ex-ante categorization based on the primary purposes of a nonprofit. Most nonprofit research seeking to statistically control for the organizational type relies on the NTEE codes. However, as Fyall, Moore, and Gugerty (2018) point out that the NTEE classification performs poorly in capturing similar nonprofits. Fyall, et al. (2018) further suggest the use of mission statements through dictionary methods as a more accurate measure. However, the accuracy of dictionary methods is highly contingent on the quality of the dictionary and expertise judgment; meanwhile, specific nonprofit subsectors do not have readily made dictionaries to use.

We apply text-as-data algorithms to automatically code and classify the organizational type with the reported mission statements and major program service activities, using the e-filers data from 2009-2016. Instead of developing dictionaries, we use unsupervised machine learning methods including topic modeling and word embedding (Mikolov et al., 2013) to create a low-dimensional numeric representation of organizational types that allow researchers to match similar organizations. Using topics generated from the Structural Topic Modeling (Roberts et al. 2014), we create interpretable labels for organizations. The analysis with 100 topics shows distribution and a network consistent with our current understanding of the nonprofit landscape, and leads to better model fit than using the NTEE codes alone.

Additionally, we extend the model by fitting labeled LDA (Latent Dirichlet Allocation) (Ramage et al., 2009) with the NTEE codes, and develop a Shiny app that helps researchers to conveniently explore organization types of nonprofit organizations in our data (e.g. maps of nonprofits with different specializations across the states). This application offers a new workflow for organization categorization based on large datasets of text documents.

This study presents a versatile categorization method for nonprofit researchers to better control for the organizational type based on what the organizations claim they do, and to fully take advantage of the newly available text information in the IRS Form 990 database. The method can also quickly classify a new organization with text information on its mission and activities. Finally, our analytical process can be applied to general policy research to analyze program/policy text information.