Observational Evaluation of Teachers: Measuring More Than We Bargained for?

Campbell, Shanyce; Campbell, Shanyce

Most states have comprehensive teacher evaluation systems that include classroom observation ratings as a key component. Some research suggests that observational ratings are valid and reliable measures of teacher quality, but new research raises concerns that they may be inequitable. Specifically, new evidence indicates that observation ratings may vary with the characteristics of teachers and the students they teach, apart from the quality of teaching being observed (Steinberg & Garrett, 2016; Campbell, 2014; Whitehurst et al., 2014). Prior research in this area suggests these trends likely reflect inequities in existing teacher evaluation systems. Though prior studies have attempted to separate inequities in observational ratings from actual differences in teacher quality, efforts have been constrained by data and other limitations.

Using secondary data from the Measures of Effective Teaching (MET) project, we make progress in disentangling differences in teacher quality from inequities in observational ratings. Funded by the Bill & Melinda Gates Foundation, the MET project collected teacher and student administrative records data, survey data, and observational data during the 2009-10 and 2010-11 academic years. We examined math and English language arts teachers in grades 4-9 who worked in five large urban school districts in the United States.

This study extends prior literature by employing methodological approaches to disentangle differences in teacher quality from inequities in observational ratings. We use teacher-by-year fixed effects to test whether a teacher gets differentially worse ratings in classrooms with more marginalized students than the same teacher in the same year in classrooms with fewer marginalized students. In alternative models, we also include classroom-level VAM scores to examine whether ratings continue to vary by student demographics even after controlling for these measures of classroom-specific teaching quality. Since teacher and student socio-demographic characteristics are often related, we also test whether previously observed relationships between observational ratings and teacher characteristics are explained by student characteristics. In separate analyses, we focus on the subset of teachers that were randomly assigned to classrooms to investigate the role of nonrandom sorting of teachers to students in explaining the relationships we observe between observational evaluations and classroom and teacher characteristics.

Our findings contribute to growing evidence that these ratings seem to measure factors outside of a teacher’s performance or control, including the gender of the teacher and the student population assigned to the teacher. Specifically, the results show that men receive lower ratings, on average, than women. Though prior evidence suggests Black teachers receive lower ratings White teachers, we demonstrate that this is largely explained by differences in classroom composition. Moreover, we provide the strongest evidence to date that teachers in classrooms with high concentrations of Black, Hispanic, male, and low-performing students receive significantly lower observation ratings and that these differences are unlikely to be due to actual differences in teacher quality or teacher-student sorting. Consistent with Whitehurst et al. (2014), the main policy implication of our study is that districts and states consider ways to account or adjust for classroom characteristics when using observational rubrics to evaluate teachers.

Association for Public Policy Analysis & Management

Panel Paper: Observational Evaluation of Teachers: Measuring More Than We Bargained for?