Developments in Methods for Measuring Noncognitive Skills

Kyllonen, Patrick; Kyllonen, Patrick

The importance of noncognitive (soft) skills as both a predictor of educational success and as an outcome of education has increasingly been recognized in the U.S. and abroad. The National Academy of Sciences has issued five reports on the importance of noncognitive skills in education, several states have social-emotional learning standards from Kindergarten through Grade 12, OECD is conducting a new study on social-emotional skills growth from elementary school into the workforce, and accountability assessments of social-emotional skills are beginning to emerge in the U.S. and internationally (e.g., Chile). Despite this activity acknowledging the importance of these skills, there exists concern over the quality of noncognitive skills measurement. Current measurement is primarily based on self-ratings using Likert scales which are sum scored, although teacher (and others’) ratings are sometimes administered.

In this presentation I review the limitations of Likert scale ratings, which are their susceptibility to (a) response-style effects, that is, the tendency to select the extremes, the middle, or the positive ends of the response scale, independently of the respondent’s underlying standing on the construct being measured, (b) reference group effects, that is, response dependency on the target of comparison, which varies by respondent, and varies over time, (c) social desirability bias, that is, responding motivated by the desire to present oneself favorably to the assessment administrator, and (d) for ratings by others, halo or horn effects, which reflect the tendency of raters to provide an overall positive or negative assessment without differentiating between different items or dimensions.

I then review research that has demonstrated the value of several alternatives for noncognitive skills measurement which mitigate some of the disadvantages of Likert scale measurement. These include (a) forced-choice (or ranking) methods, (b) anchoring vignettes, (c) situational judgment tests, and (d) performance tasks. Item-response-theory methods for scoring forced-choice (and ranking) tests, in which the respondent ranks items from a block (of size 2 to 4) have shown to yield better correlations with educational and workforce outcomes than Likert scale measures. Anchoring vignettes of personality and climate constructs have been shown in international large-scale educational assessments to increase comparability and increase predictions with achievement. The validity evidence for situational judgment tests (SJTs) has been documented in many contexts, but reliability has been a challenge, requiring longer assessment time to achieve reliable measurement. New item-response-theory methods for scoring SJTs reduce that time making SJTs a more viable assessment instrument. Finally, I discuss one performance-based measure of noncognitive constructs, a collaborative problem-solving tasks, which assesses collaborative skills based on respondent choices and dialogues with collaborators. All these examples are reviewed with sample tasks illustrating the methods, and with data demonstrating their potential advantages over traditional rating-scale measurement. I finish with a discussion about prospects for using these methods in studies by considering what problem the alternative approach is designed to solve, the strength of evidence for it being able to solve that problem, and the costs associated with implementing the alternative solution.

Association for Public Policy Analysis & Management

Panel Paper: Developments in Methods for Measuring Noncognitive Skills