Poster Paper: Machine Learning Algorithms for Longitudinal Clustering

Saturday, November 10, 2018
Exhibit Hall C - Exhibit Level (Marriott Wardman Park)

*Names in bold indicate Presenter

Alberto Guzman-Alvarez and Lindsay C. Page, University of Pittsburgh


Machine Learning (ML) in the age of big data in education is still in its infancy but has already borne great fruits, especially with the use of unsupervised ML algorithms for tasks such as cluster analysis. Most of the early research has focused on clustering high-dimensional cross-sectional data, with comparatively less focus on clustering techniques applied to longitudinal data sets. In longitudinal research, an important question concerns the estimation of homogeneous trajectories (Genolini, Alacoque, Sentenac & Arnaud, 2015). A standard way to analyze variable trajectories is to cluster individuals into distinct groups with homogenous characteristics. One advantage of using this data reduction technique is that it enables several continuous correlated variables to be reduced to a single categorical variable. This study focused on the application of ML clustering algorithms to a behavioral nudge intervention with the purpose of using the clusters to estimate heterogeneous treatment effects.

Data for this analysis came from an RCT that used an automated and personalized text message intervention to remind college-going students of required college enrollments tasks and connected them with counselor-based support via text message communication. These types of low-cost behavioral nudges are increasing popular in policy research. From this application, we examine data from students’ engagement with the intervention across several time points. The study included 20-time points with more than 20,000 thousand students participating. Student engagement in the outreach varied. For example, we observe variation in the intensity of engagement regarding the number of messages sent to counselors during each time point as well as in student engagement as measured by message character count over time.

We use longitudinal clustering with various constructed student engagement trajectories as model inputs. With these clustering algorithms, we classify students into different engagement groups that reveal the typical patterns of interaction in the intervention. Preliminary analysis shows that the variation in cluster assignment is driven by variation in behavior during the students' senior year of HS (in contrast to other time periods of the intervention, such as during the spring of students’ HS junior year). Each cluster corresponds to different levels of engagement. These clustering methods help to illuminate patterns of student behavior within an intervention as well as to inform the time periods over which student behavior was most differentiated. This work is a precursor to estimating treatment effects for various outcomes using the constructed trajectory clusters.