Panel Paper: Measuring Test Measurement Error: A General Approach

Saturday, November 10, 2012 : 4:10 PM
Hanover B (Radisson Plaza Lord Baltimore Hotel)

*Names in bold indicate Presenter

Donald J. Boyd, University of Albany- SUNY, Hamilton Lankford, State University of New York, Albany, Susanna Loeb, Stanford University and James Wyckoff, University of Virginia


Recent educational policies, such as increased accountability, efforts to measure teacher effectiveness and dramatic increases in research exploring the effects of various policy interventions, rely on achievement tests as an important metric to assess student skills and knowledge. Yet we know little regarding some properties of these tests that bear directly on their use and interpretation. For example, are various tests aligned with a particular set of educational standards or the outcomes of interest to policymakers or analysts? To what extent does test construction result in ceiling or floor effects? What is the extent of test measurement error and what are the implications for educational policy and practice?

Rather than analyzing the consistency of student test scores over occurrences, the standard approach used by test vendors is to divide the test taken at a single point in time into what is hoped to be parallel parts. Reliability measured with respect to the consistency (i.e., correlation) of students’ scores across these parts only accounts for the measurement error resulting from the random selection of a set of test items from the relevant population of items.

As Feldt and Brennan (1989) note, this approach “frequently present[s] a biased picture” in that “reported reliability coefficients tend to overstate the trustworthiness of educational measurement, and standard errors underestimate within-person variability,” the problem being that measures based on a single test occurrence ignore potentially important day-to-day differences in student performance.

In this paper we show that there is a credible approach for measuring the overall extent of test measurement error that can be applied in a wide variety of settings. Estimation is straightforward and only requires estimates of the correlation or covariance of test scores in the subject of interest at several points in time (e.g., the correlations between third-, fourth- and fifth grade math scores for one cohort of students). Note that one need not have student-level test score data, provided that one has estimates of test-score correlations or covariances. Our approach generalizes the test-retest framework to allow for either growth or decay in the knowledge, skills and abilities of students between the test administrations as well as variation across tests in the extent of measurement error. Utilizing the estimated test-score covariance or correlation matrix and a few assumptions regarding the structure of student achievement growth, it is possible to estimate the overall extent of test measurement error and decompose the variance of test scores into the part attributable to real differences in academic achievement and the part attributable to measurement error.