Poster Paper: Test Score Manipulation or Statistical Artifact: A Re-Examination of Grading Manipulation on the New York Regent Exam

Thursday, November 8, 2018
Exhibit Hall C - Exhibit Level (Marriott Wardman Park)

*Names in bold indicate Presenter

Sophie L Litschwartz, Harvard University


Many high school students, in the United States and internationally, are required to pass high stakes exams to receive their high school diploma. In the United States, New York State has required exams since 1878 and is the longest running high school exit exam program in the United States. These exams, called the New York State Regent Exams, have traditionally been graded by students’ own teachers and policy required exams just below the passing cut to be automatically re-graded. In 2011, reporting by the Wall Street Journal and research by Dee et al. (2011) showed large discontinuities in the New York City Regent test score distribution around the passing cutoff. Public concern that teachers were unfairly manipulating scores to be just above the passing cutoff lead the New York State Regent Board to eliminate exam re-grading and eventually move all exams to be centrally graded.

In this poster I re-examine the original analysis attributing the full Regent Test Score discontinuity to intentional teach action. Test scores are measured and graded with error. Under a classical test theory framework a student’s test score can be broken up into three components:

X=T+Etest+Egrade

with X as the observed score, T as the student’s unobserved true score, Etest as the error in the testing, and Egrade as the error in the grading. Under this framework the central policy of selectively re-grading student exams leads to distribution discontinuities without any intentional teacher action. Students with low true scores who, through random chance, end up to the right of the passing threshold stay there. On the other hand, students with high true scores who randomly end up to the left of the threshold are likely to also end up to the right of the threshold in the final re-grade. This process naturally leads to excess density just to the right of the passing threshold.

Here I show that relatively small amounts of error can account for large discontinuities. I use scores from the 2009 June Integrated Algebra subject test to simulate the expected distribution with the re-scoring policy, but no intentional teacher manipulation. The observed discontinuity at the cutoff is 5.1% of the total distribution (.07% of students are to the left of the cutoff and .58% are to the right). Assuming a test score grading reliability of between .99 and .5, re-grading accounts for between 15% to 40% of the test score discontinuity. When I assume the test grading reliability is the reported constructed response reliability of .87; I find that the selective re-grading policy accounts for 32% of the observed discontinuity. The re-grading policy, therefore, does not account for the whole discontinuity, but it does account for a significant portion of the discontinuity.