Panel Paper: Using Validity Criteria to Enable Model Selection: An Exploratory Analysis

Thursday, November 7, 2013 : 9:45 AM
3015 Madison (Washington Marriott)

*Names in bold indicate Presenter

Mark Chin1, Heather Hill1, Daniel Mcginn1, Douglas Staiger2 and Kate Buckley1, (1)Harvard University, (2)Dartmouth College
Despite value-added scores’ importance in new teacher evaluation systems, there is a lack of consensus around how to specify these models (Tekwe et al., 2004; Goldhaber & Theobald 2012). In a recent review, Goldhaber and Theobald (2012) noted that the organizations supplying value-added scores to districts utilize a range of models, from simple models controlling for just prior test scores to models that control for student demographics either at the individual or aggregate level, to models that compare teachers only within schools via school fixed effects. Different models often result in transformations in teachers’ ranks within district and thus may affect key decisions such as rewards and job security.

Although model choice matters, academics have provided districts little advice as to which to employ. In this paper, we propose that an important determinant of model choice should be alignment with alternative indicators of teacher and teaching quality. Such alignment makes sense from a theoretical perspective because better alignment is thought to indicate more valid systems. There are also practical reasons to allow model choice to be governed by alignment, because successful efforts to improve scores on educational inputs (e.g., teaching or teacher quality) should lead to improved scores on student outcomes.

To provide initial evidence on this issue, we first calculated value-added scores for all fourth and fifth grade teachers within four districts, then extracted scores for 160 intensively studied teachers.  The models used were:

  • A baseline model controlling just for student prior test scores and student demographic indicators;
  • A model that adds classroom-aggregated peer effects to the baseline model;
  • A model that includes prior scores, student demographic indicators, and school fixed effects;
  • A student growth percentile model (Betebenner, 2009).

Each model will be estimated using one and three years of student test score data.

We scored the 160 teachers on alternative indicators that prior work has suggested are related to student outcomes:

  • A test of mathematical knowledge for teaching (Hill, Rowan & Ball, 2005)
  • A teacher efficacy indicator (Tschannen-Moran et al., 1998);
  • The Mathematical Quality of Instruction and CLASS observational instruments (Hill, Kapitula & Umland, 2011; Bell et al., 2012)
  • TRIPOD, a survey of students regarding the nature of classroom work (Kane & Staiger, 2012)

We will explore methods to combine these indicators into a single score, thus enabling us to examine average correlations between VAM model and an alternative composite. We will also use percent of variance in VAM score explained by individual predictors as a second metric.

Initial analyses using a subset of alternative indicators suggest that alignment between value-added scores and alternative indicators differ by model, though not significantly. Scores from the single-year baseline model accounting only for student prior achievement and background characteristics aligned most closely to the composite (r(158)= .26,p<.01), compared to scores from single-year models with peer effects (r(158)= .21,p<.01) and single-year models with school fixed effects (r(158)= .23, p<.01). Models using three years of student data demonstrated more stable correlations with the composite indicator (r(158)=0.22, p < 0.01 for all cases).