Poster Paper: Looking into the Crystal Ball: High School Dropout Prediction Using Statistical Learning Algorithms

Friday, November 4, 2016
Columbia Ballroom (Washington Hilton)

*Names in bold indicate Presenter

Lucy C. Sorensen, Duke University


Increasing the rate at which students graduate high school would greatly benefit both students’ long-term welfare and societal welfare more broadly.  Americans who dropped out of high school make lower earnings on average and are more likely to engage in antisocial behaviors (BLS 2011; Wolfe and Haveman 2002).  Two primary policy questions are intertwined here.  First, what interventions and policies can effectively prevent dropping out?  Second, how can these interventions and policies best target student populations at the highest risk of not graduating?  Although the current study does not speak directly to the first policy question, it does provide insight into new methods for accurately identifying, early in their schooling, those students at high risk of quitting high school.  

Traditionally, researchers have used analytical tools such as logistic regression to determine which individual factors contribute to propensity for dropping out.  My study instead uses machine-learning algorithms with longitudinal administrative data from North Carolina public schools to enhance our understanding of what academic and behavioral factors predict high school graduation and dropout.  These methods, which take advantage of flexible, non-parametric, and efficient data mining techniques, allow much greater precision in terms of accurately predicting high school graduation from longitudinal student datasets.  In concrete terms, whereas earlier attempts for predicting high school graduation or dropping out for eighth grade students achieved only 75 percent success, I correctly predict graduation outcomes for up to 91 percent of students in recent cohorts. 

The analyses illustrate how we can use sophisticated statistical learning algorithms to glean formerly “invisible” information about students and to greatly augment our understanding of educational trajectories.  In a broad comparison of methods, I find that support vector machine (SVM) classifiers most accurately predict high school dropout or graduation based on early academic, behavioral, and background indicators, closely followed by classification trees with boosting. Both students’ non-cognitive behaviors and academic successes or failures matter for predicting their ultimate educational trajectories.  Family background and demographic characteristics, on the other hand, appear less salient once I include a large set of other individual traits and behaviors in the model. 

This study also explores patterns in educational decision-making during an economic recession, taking advantage of geographical variation in the extent of job loss in North Carolina.  I find that although students (and particularly males) suffered worse during the recession in terms of labor market outcomes, local economic downturn propelled students to graduate high school at a much higher rate than previously observed.

With ever-increasing quantities of micro-level educational data and widespread availability of vast computational resources, this study demonstrates how we can utilize data science methods to complement other forms of quantitative education policy research.  Machine learning algorithms offer great promise for both increasing our understanding of educational processes and for providing powerful prediction systems for schools and policy-makers.  If educators and administrators choose to use these methods to reliably identify students at risk of dropping out of high school, they could more effectively provide targeted, intensive programs at the lowest possible cost.