Panel Paper: Do Common Approaches to Combining Multiple Performance Measures Undermine Districts' Personnel Evaluation Systems?

Friday, November 8, 2013 : 8:40 AM
Washington Ballroom (Westin Georgetown)

*Names in bold indicate Presenter

Michael Hansen, Mariann Lemke and Nicholas Sorensen, American Institutes for Research
Teacher and principal evaluation systems now emerging in response to federal, state and/or local policy initiatives typically require that a component of teacher evaluation be based on multiple performance measures, which must be combined to produce summative ratings of teacher effectiveness. The process of combining these metrics alone can influence the utility of the evaluation system overall. Early-reforming states and districts have utilized three common approaches to combine these multiple performance measures in their evaluation systems, all three of which introduce additional prediction error and in some cases bias that was not present in the measures originally.  This paper investigates whether the error and bias introduced by these approaches erodes the ability of evaluation systems to reliably identify high- and low-performing teachers.

Using simulated data based on estimated inter-correlations and reliability of measures in the Gates Foundation’s Measures of Effective Teaching project, this analysis compares the correct classification rates of and expected differences in long-term teacher value-added among teachers identified as high- or low-performing under these three commonly used approaches. We additionally investigate how changes in component weights and the use of reliability-adjusted performance measures affects the identification of high and low performers. Based on the results of our simulation exercise presented here, we conclude the choice of these approaches is important and can undermine the evaluation system’s objectives in some contexts. Specifically, the numeric approach is the preferred approach among the three common approaches considered and in several circumstances is not statistically distinguishable from the best-case error-minimizing approach that cannot be implemented in practice. In some circumstances, namely when using component weights that are misaligned with the optimal weighting structure or when using reliability-adjusted measures, one or both of the two remaining approaches perform significantly worse than the numeric approach and can render the evaluation system useless.