Poster Paper: Can Machine Learning Improve Early Warning Indicators for High School Dropouts?

Saturday, November 10, 2018
Exhibit Hall C - Exhibit Level (Marriott Wardman Park)

*Names in bold indicate Presenter

Lily Fesler and Thomas Dee, Stanford University


High school graduation is a critical step for students who want access to institutions of higher education and a secure economic future. However, many students drop out of school before receiving their high school diploma, and this problem is even more serious for racial minorities. In San Francisco, 16 percent of Black students and 13 percent of Latino students drop out of high school, compared to an overall dropout rate of 7 percent (Barba, 2016).

School districts across the country have started using early warning systems (EWS) to help find students who are at the highest risk of dropping out. These systems give districts a few years to reach out to struggling students before they drop out. However, these districts often use basic prediction algorithms based off of just a couple of simple indicators.

Machine learning techniques have the potential to substantially increase the accuracy of high school dropout predictions (Kleinberg, Ludwig, Mullainathan, & Obermeyer, 2015). These techniques can incorporate more detailed information about students, including academic and behavioral records from each year of schooling. They are also designed to optimize out-of-sample prediction, meaning they are ideal for a problem in which we use data from older students to predict outcomes for younger students.

If districts had access to more accurate student-level high school dropout predictions, they would be able to more effectively target students who would benefit the most from additional supports. This paper investigates how much machine learning techniques can improve these predictions over the methods that are currently in use.

Research Questions

We have two research questions:

  1. How do predictions from machine learning algorithms compare to the early warning algorithms currently in use?
  2. How does the accuracy of predictions improve with each additional year of data?


We use data from the San Francisco Unified School District (SFUSD). SFUSD’s current EWS identifies students as at risk of dropping out if their eighth grade GPA is below 2.0 and their eighth grade attendance rate is below 87.5 percent. We substantially expand the number of variables to include in the EWS algorithm. We have data from 2000 to 2016 for students in Pre-K through twelfth grades on variables such as background characteristics, EL and special education status, number of suspensions and expulsions, grades, and state assessments.

Research Design

We use lasso, ridge, elastic net, random forest, and ensemble methods to predict which students are likely to drop out of high school, then use cross-validation to assess the performance of each method. We use a loss function to assess the performance of each method, and to compare these machine learning methods to the current two-variable system being used in SFUSD. We also estimate how the accuracy of the predictions change with additional years of data.


This paper investigates the prediction gains that machine learning techniques make over school districts’ current early warning systems. These gains may be of interest to data-driven school districts who are interested in targeting struggling students earlier in their academic careers.