Introduction to Many-Facet Rasch Measurement

Analyzing and Evaluating Rater-Mediated Assessments. 2nd Revised and Updated Edition

by Thomas Eckes (Author)
Monographs 241 Pages
Series: Language Testing and Evaluation, Volume 22

Table Of Content

  • Cover
  • Title
  • Copyright
  • About the author
  • About the book
  • This eBook can be cited
  • Contents
  • Preface to the First Edition
  • Preface to the Second Edition
  • 1. Introduction
  • 1.1 Facets of measurement
  • 1.2 Purpose and plan of the book
  • 2. Rasch Measurement: The Basics
  • 2.1 Elements of Rasch measurement
  • 2.1.1 The dichotomous Rasch model
  • 2.1.2 Polytomous Rasch models
  • 2.2 Rasch modeling of many-facet data
  • 2.2.1 Putting the facets together
  • 2.2.2 The sample data: Essay ratings
  • 2.2.3 Rasch modeling of essay rating data
  • 3. Rater-Mediated Assessment: Meeting the Challenge
  • 3.1 Rater variability
  • 3.2 Interrater reliability
  • 3.2.1 The standard approach
  • 3.2.2 Consensus and consistency
  • 3.2.3 Limitations of the standard approach
  • 3.3 A conceptual–psychometric framework
  • 3.3.1 Proximal and distal facets
  • 3.3.2 A measurement approach
  • 4. Many-Facet Rasch Analysis: A First Look
  • 4.1 Preparing for a many-facet Rasch analysis
  • 4.2 Measures at a glance: The Wright map
  • 4.3 Defining separation statistics
  • 4.4 Applying separation statistics
  • 4.5 Global model fit
  • 5. A Closer Look at the Rater Facet: Telling Fact from Fiction
  • 5.1 Rater measurement results
  • 5.1.1 Estimates of rater severity
  • 5.1.2 Rater fit statistics
  • 5.1.3 Observed and fair rater averages
  • 5.2 Studying central tendency and halo effects
  • 5.2.1 Central tendency
  • 5.2.2 Halo
  • 5.3 Raters as independent experts
  • 5.4 Interrater reliability again: Resolving the paradox
  • 6. Analyzing the Examinee Facet: From Ratings to Fair Scores
  • 6.1 Examinee measurement results
  • 6.2 Examinee fit statistics
  • 6.3 Examinee score adjustment
  • 6.4 Criterion-specific score adjustment
  • 7. Criteria and Scale Categories: Use and Functioning
  • 7.1 Criterion measurement results
  • 7.2 Rating scale structure
  • 7.3 Rating scale quality
  • 8. Advanced Many-Facet Rasch Measurement
  • 8.1 Scoring formats
  • 8.2 Dimensionality
  • 8.3 Partial credit and hybrid models
  • 8.4 Modeling facet interactions
  • 8.4.1 Exploratory interaction analysis
  • 8.4.2 Confirmatory interaction analysis
  • 8.5 Summary of model variants
  • 9. Special Issues
  • 9.1 Rating designs
  • 9.2 Rater feedback
  • 9.3 Standard setting
  • 9.4 Generalizability theory (G-theory)
  • 9.5 MFRM software and extensions
  • 10. Summary and Conclusions
  • 10.1 Major steps and procedures
  • 10.2 MFRM across the disciplines
  • 10.3 Measurement and validation
  • 10.4 MFRM and the study of rater cognition
  • 10.5 Concluding remarks
  • References
  • Author Index
  • Subject Index

Preface to the First Edition

This book grew out of times of doubt and disillusionment, times when I realized that our raters, all experienced professionals specifically trained in rating the performance of examinees on writing and speaking tasks of a high-stakes language test, were unable to reach agreement in the final scores they awarded to examinees. What first seemed to be a sporadic intrusion of inevitable human error, soon turned out to follow an undeniable, clear-cut pattern: Interrater agreement and reliability statistics revealed that ratings of the very same performance differed from one another to an extent that was totally unacceptable, considering the consequences for examinees’ study and life plans.

So, what was I to do about it? Studying the relevant literature in the field of language assessment and beyond, I quickly learned two lessons: First, rater variability of the kind observed in the context of our new language test, the TestDaF (Test of German as a Foreign Language), is a notorious problem that has always plagued human ratings. Second, at least part of the problem has a solution, and this solution builds on a Rasch measurement approach.

Having been trained in psychometrics and multivariate statistics, I was drawn to the many-facet Rasch measurement (MFRM) model advanced by Linacre (1989). It appeared to me that this model could provide the answer to the question of how to deal appropriately with the error-proneness of human ratings. Yet, it was not until October 2002, when I attended a workshop on many-facet Rasch measurement conducted by Dr. Linacre in Chicago, that I made up my mind to use this model operationally with the TestDaF writing and speaking sections. Back home in Germany, it took a while to convince those in charge of our testing program of the unique advantages offered by MFRM. But in the end I received broad support for implementing this innovative approach. It has been in place now for a number of years, and it has been working just fine.

In a sense, then, this book covers much of what I have learned about MFRM from using it on a routine basis. Hence, the book is written from an applied perspective: It introduces basic concepts, analytical procedures, and statistical methods needed in constructing proficiency measures based on human ratings of examinee performance. Each book chapter thus serves to corroborate the famous dictum that “there is nothing more practical than a good theory” (Lewin, 1951, p. 169). Though the focus of the MFRM applications presented herein is on language assessment, the basic principles readily generalize to any instance of ← 9 | 10 → rater-mediated performance assessment typically found in the broader fields of education, employment, the health sciences, and many others.

The present book emerged from an invited chapter included in the Reference Supplement to the Manual for Relating Language Examinations to the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2009), Section H (Eckes, 2009a). Once more, I would like to thank the members of the Council of Europe’s Manual Authoring Group, Brian North, Sauli Takala (editor of the Reference Supplement), and Norman D. Verhelst, for helpful comments and suggestions on earlier drafts of that chapter. In addition, I received valuable feedback on the chapter from Rüdiger Grotjahn, Klaus D. Kubinger, J. Michael Linacre, and Carol M. Myford. When the chapter had evolved into this introduction, I was lucky enough to receive again feedback on the completely revised and expanded text, or parts of it, from Mike Linacre and Carol Myford. I highly appreciate their support and encouragement during my preoccupation with some of the more intricate and challenging issues of the MFRM approach. Of course, any remaining errors and shortcomings are mine.

I would also like to express my gratitude to my colleagues at the TestDaF Institute, Bochum, Germany, for many stimulating discussions concerning the design, analysis, and evaluation of writing and speaking performance assessments. Special thanks go to Achim Althaus, Director of the TestDaF Institute, who greatly supported me in striking a new path for designing a high-quality system of performance ratings. The editors of the series Language Testing and Evaluation, Rüdiger Grotjahn and Günther Sigott, warmly welcomed my book proposal. Sarah Kunert and Miriam Matenia, research assistants at the TestDaF Institute, helped with preparing the author and subject indexes.

Last, but not least, I would like to thank those persons close to me. My wife Andrea encouraged me to get the project started and provided the support to keep going. My children Laura and Miriam shared with me their experiences of rater variability at school (though they would not call it that), grumbling about Math teachers being unreasonably severe and others overly lenient, or about English teachers eagerly counting mistakes and others focusing on the skillful use of idiomatic expressions, to mention just a few examples. Looking back at my own schooldays, it is tempting to conclude that rater variability at school is one of the most reliable things in life. At the same time, this recurring variability pushed my motivation for finishing the book project to ever higher levels.

Indeed, my prime goal of writing this book was to introduce those who in some way or another employ, oversee, or evaluate rater-mediated performance assessments to the functionality and practical utility of many-facet Rasch ← 10 | 11 → measurement. To the extent that readers feel stimulated to adopt the MFRM approach in their own professional context, this goal has been achieved. So, finally, these are times of hope and confidence.

Thomas Eckes

March, 2011
← 11 | 12 →

← 12 | 13 →

Preface to the Second Edition

This second edition of my Introduction to Many-Facet Rasch Measurement is an extensive revision of the earlier book. I have been motivated by the many positive reactions from readers, and by learning that researchers and practitioners across wide-ranging fields of application are more than ever ready to address the perennial problems inherent in rater-mediated assessments building on a many-facet Rasch measurement approach.

In the present edition, I have revised and updated each chapter, expanded most chapters, and added a completely new chapter. Here I provide a brief outline of the major changes: Chapter 2 (“Rasch Measurement: The Basics”) discusses more deeply the fundamental, dichotomous Rasch model, elaborating on key terms such as latent variable, item information, and measurement invariance. Chapter 5 (“A Closer Look at the Rater Facet”) has been reorganized, dealing in a separate section with rater severity estimates and their precision; the section on rater fit statistics now includes a detailed discussion of the sample size issue. Further major amendments concern Chapter 6 (“Analyzing the Examinee Facet”), with new sections on examinee measurement results, examinee fit statistics, and criterion-specific score adjustment, and Chapter 7 (“Criteria and Scale Categories”), with new sections on criterion measurement results, manifest and latent rating scale structures, and indicators of rating scale quality. Chapter 8 (“Advanced Many-Facet Rasch Measurement”), probes more deeply into methods of confirmatory interaction analysis, focusing on approaches using dummy facets. The book now closes with Chapter 10 (“Summary and Conclusions”). In this chapter, I recapitulate relevant steps and procedures to consider when conducting a many-facet Rasch analysis, briefly discuss MFRM studies in a number of different fields of application, reconsider the implications of many-facet Rasch measurement for the validity and fairness of inferences drawn from assessment outcomes, and highlight the use of MFRM models within the context of mixed methods approaches to examining raters’ cognitive and decision-making processes.

I would like to thank my colleagues at the TestDaF Institute, University of Bochum, Germany, for keeping me attuned to the practical implications of many-facet Rasch measurement within a high-stakes assessment context. I am also grateful to research assistants Anastasia Bobukh-Weiß and Katharina Sokolski who diligently updated the author and subject indexes. My special thanks are due to Carol Myford and Mike Linacre who once more took their time to provide ← 13 | 14 → me with valuable feedback on this revised edition. As before, the errors that may remain are entirely my own.

The many-facet extension of the basic, dichotomous Rasch model ensures real progress by enhancing the validity and fairness of rater-mediated assessments across a steadily growing number of disciplines. Hopefully, this book will continue to play its part in further disseminating the rationale and practical utility of many-facet Rasch measurement.

Thomas Eckes

May, 2015
← 14 | 15 →

1.  Introduction

This chapter introduces the basic idea of many-facet Rasch measurement. Three examples of assessment procedures taken from the field of language testing illustrate the broader context of its application. In the first example, examinees respond to items of a reading comprehension test. The second example refers to a writing performance assessment, where raters evaluate the quality of essays. In the third example, raters evaluate the performance of examinees on a speaking assessment involving live interviewers. Having discussed key concepts such as facets and rater-mediated assessment, the general steps involved in adopting a many-facet Rasch measurement approach are pointed out. The chapter concludes with an outline of the book’s purpose and a brief overview of the chapters to come.

1.1    Facets of measurement


ISBN (Book)
Publication date
2011 (June)
Language Testing Rater Effects Educational Measurement Performance Assessment Rasch Model
Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, 2015. 241 pp.

Biographical notes

Thomas Eckes (Author)

Thomas Eckes is Head of the Psychometrics and Research Methodology Department, TestDaF Institute, University of Bochum, Germany. His research interests include language testing, multivariate data analysis, large-scale assessments, psychometric modeling of language competencies, and web-based testing.


Title: Introduction to Many-Facet Rasch Measurement