Panel Paper:
Innovative Methods: Classifying Lung Cancer Stage from Health Care Claims – Comparison of a Clinical Algorithm and a Machine-Learning Approach
*Names in bold indicate Presenter
Study Design: We conducted an observational study using data from the Surveillance, Epidemiology, and End Results (SEER) cancer registry data linked with fee-for-service Medicare data. For patients with newly diagnosed lung cancer, we estimated cancer stage group based on Medicare Part A and B claims for services received in the three months before and after the first chemotherapy treatment, classifying patients as either stage 1-3 or stage 4. The SEER stage data were considered to be the gold standard for cancer stage. The first classification method involved a clinically derived algorithm that assigned stage group based on the pattern of treatment received (inclusive of surgery, radiation, and chemotherapy). The second method employed an ensemble of machine learning algorithms and analyzed an expanded set of claims-derived variables that included treatments received, inpatient and outpatient visits, diagnosis codes, and demographic variables. To generate relatively parsimonious algorithms with greater potential for practical use, we implemented a variable reduction approach using the LASSO (Least Absolute Shrinkage and Selection Operator) to select variable sets within each cross-validation fold. We investigated six different thresholds for the maximum number of variables selected, including 10, 15, 20, 30, 40, and 50 variable thresholds. Classification methods were evaluated and compared on the basis of sensitivity, specificity, and accuracy, expressed in reference to the stage 1-3 group.
Population Studied: Adults diagnosed with lung cancer in 2011 or 2012 who were enrolled in fee-for-service Medicare and who received chemotherapy within six months of lung cancer diagnosis.
Results: The study sample included 14,743 lung cancer patients with a mean age of 72.1 years (standard deviation=7.7 years) and 54.6% were male. The mean annual household income of the zip code of residence was $58,600. For the clinical algorithm, sensitivity and specificity for identifying early-stage disease were 53% (95%CI=52%-54%) and 89% (95%CI=88%-90%), respectively, and accuracy was 71% (99%CI=71%-72%). The top performing classifier from the machine learning ensemble was the random forest algorithm. Sensitivity, specificity, and accuracy for the random forest algorithm with 15 variables was 91% (95%CI=90%-92%), 89% (95%CI=88%-90%), and 90% (95%CI=90%-91%), respectively. Key variables for the random forest algorithm included the number and rate of secondary malignancy codes, treatments received (including type of surgery, number of radiation fractions, specific chemotherapy agents), presence of chronic obstructive pulmonary disease, and region of residence.
Conclusion: Compared with a clinically derived algorithm, a machine learning classifier demonstrates similar specificity and substantially improved sensitivity and accuracy.
Implications for Policy, Practice, or Delivery: Improved accuracy of stage classification could serve an important role in building a learning health care information system, providing necessary structure for clinically relevant, real-world analyses of cancer care delivery processes, quality measures, and clinical outcomes.
Funding: Section 3021, Affordable Care Act