Loading...

Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment

by Paraskevi (Voula) Kanistra (Author)
©2026 Thesis 432 Pages
Open Access
Series: Language Testing and Evaluation, Volume 50

Summary

This book turns the page on standard setting, calling for a time of change. Expanding Cizek & Earnest's (2016) evaluation framework, it delivers a comprehensive mixed-methods investigation of Ferrara & Lewis' (2012) Item Descriptor Matching method, using a multi-phase design across face-to-face and virtual workshops. At its core lies the author's Unified Alignment & Test Development (UATD) framework, which embeds a unique quantitative principled cut score approach to calculate defensible, trustworthy thresholds grounded in theory. Using corpus linguistics and AI models to analyse panel discussions, the book shows how innovative methodologies enhance the validity and robustness of CEFR-linking studies. It also translates the panelists' bottom-up and top-down strategies to tame the CEFR into innovative activities for the familiarisation stage. The result is an expanded, transparent, and forward-looking practice that strengthens the validity, fairness, and impact of standard setting.

Table Of Contents

  • Cover
  • Title Page
  • Copyright Page
  • Dedication
  • Table of Contents
  • Figures
  • Tables
  • Abbreviations
  • Preface
  • Acknowledgments
  • Abstract
  • CHAPTER 1 Introduction
  • 1.1 Structure of the book
  • CHAPTER 2 Literature Review
  • 2.1 An overview of standard setting
  • 2.1.1 The Angoff method
  • Advantages and disadvantages of the Angoff method
  • 2.1.2 The Bookmark Standard Setting Procedure
  • The ordered item booklet
  • The panelist task and cut score calculation
  • Advantages and disadvantages of the Bookmark Standard Setting Procedure
  • 2.1.3 The Item Descriptor Matching method
  • Background
  • Description of the ID standard setting task
  • Description of the standard setting process
  • Advantages and disadvantages of the Item Descriptor Matching method
  • 2.1.4 The Embedded Standard Setting method
  • Examinee-centered methods
  • 2.1.5 The Performance Profile method
  • Advantages and disadvantages of the Performance Profile method
  • 2.1.6 The Dominant Profile method
  • Advantages and disadvantages of the Dominant Profile Method
  • 2.1.7 The Body of Work method
  • Advantages and disadvantages of the Body of Work Method
  • 2.2 Aligning examinations to the CEFR
  • 2.3 Issues with the CEFR and with aligning examinations to the CEFR
  • 2.4 Virtual standard setting
  • 2.5 Contextualizing the research study
  • CHAPTER 3 Background to the Study
  • 3.1 Description of the ISE examination
  • 3.1.1 Selection of tasks
  • 3.2 Methodology and process of the Trinity benchmarking study
  • 3.2.1 The standard setting process
  • 3.2.2 The standard setting method
  • 3.2.3 Participants
  • 3.3 Summary of background to the study
  • CHAPTER 4 Methodology
  • 4.1 Aim, design, and research questions
  • Research questions
  • 4.1.1 Study design
  • 4.1.2 Overview of the study
  • 4.1.3 Conducting the virtual workshop
  • 4.1.4 Background to focus group interviews
  • 4.1.5 Focus group interviews
  • 4.2 Framework for evaluating standard setting workshops
  • 4.3 Materials and data collection instruments
  • Orientation and training in the method materials
  • The ordered item booklet and item map
  • 4.3.1 Evaluation questionnaires
  • 4.4 Data collection
  • 4.5 Methods of analyses
  • 4.5.1 Data analyses for procedural evaluation
  • 4.5.2 Data analyses for the internal evaluation of the standard setting workshops
  • 4.5.2.1 Inter-panelist and intra-panelist consistency within the CTT paradigm
  • 4.5.2.2 Inter-panelist and intra-panelist consistency within the RMT paradigm
  • RMT background
  • Inter-panelist and intra-panelist consistency
  • Synopsis of the Rasch inter- and intra-panelist indices
  • 4.5.2.3 Consistency within the method
  • 4.5.2.4 Decision accuracy and consistency
  • 4.5.3 Data analyses for external evaluation
  • 4.5.4 Analyzing focus group interviews
  • 4.5.5 Analyzing panelist discussion (Study B)
  • 4.6 Summary of methodology
  • CHAPTER 5 Procedural Validity
  • 5.1 Evaluating the Orientation and Training in the method stages
  • 5.2 Evaluating the standard setting of the Reading component
  • 5.3 Evaluating the benchmarking study of the Reading-into-writing component
  • 5.4 Influencing panelist judgments
  • 5.5 Conclusion: procedural validity
  • CHAPTER 6 Validating the Reading-Into-Writing Workshops
  • 6.1 Inter- and intra-panelist consistency for the Reading-into-writing component
  • 6.1.1 Inter-panelist consistency: CTT framework
  • 6.1.2 Inter-panelist consistency: RMT framework
  • 6.1.3 Intra-panelist consistency: CTT framework
  • 6.1.4 Intra-panelist consistency: RMT framework
  • 6.2 Consistency within the method for the Reading-into-writing component
  • 6.2.1 Comparing the internal and external panelist groups
  • 6.2.2 Evaluating the accuracy and precision of the Reading-into-writing cut score
  • 6.2.3 Decision accuracy and consistency for the Reading-into-writing cut score
  • 6.3 External validity
  • 6.3.1 Comparing panelist groups across modes: Reading-into-writing
  • 6.3.2 DGF across studies and modes: Reading-into-writing
  • 6.3.3 DPF across studies and modes: Reading-into-writing
  • 6.3.4 Consistency, impact, and reasonableness of the Reading-into-writing judgments
  • 6.4 Conclusion: validating the Reading-into-writing component
  • CHAPTER 7 Validating the Reading Workshops
  • 7.1 Inter- and intra-panelist consistency for the Reading component
  • 7.1.1 Inter-panelist consistency: CTT framework
  • 7.1.2 Inter-panelist consistency: RMT framework
  • 7.1.3 Intra-panelist consistency: CTT framework
  • 7.1.4 Intra-panelist consistency: RMT framework
  • 7.2 Consistency within the method for the Reading component
  • 7.2.1 Comparing the internal and external panelist groups: Reading Study A
  • 7.2.2 Locating the recommended cut score for the Reading component
  • 7.2.3 Evaluating the accuracy and precision of the Reading cut score
  • 7.2.4 Decision accuracy and consistency for the Reading cut score
  • 7.3 External validity
  • 7.3.1 Comparing panelists across modes: Reading Component
  • 7.3.2 DGF across studies and modes: Reading
  • 7.3.3 DPF across studies and modes: Reading
  • 7.3.4 Consistency of the Reading judgments
  • 7.3.5 Reasonableness of recommended cut scores
  • 7.4 Conclusion: validating the Reading component
  • CHAPTER 8 Calculating Cut Scores in a Single-Level Examination
  • 8.1 Framework for calculating threshold region(s) and cut score(s)
  • 8.2 Operationalizing the framework for calculating cut scores
  • Step 1: Establishing the predictive power of each item
  • Step 2: Converting ability measures or raw scores to z scores
  • Step 3: Establishing item clusters
  • Step 4: Exploring the predictive power of the calculated threshold regions
  • Step 5: Evaluating the calculated cut scores
  • CHAPTER 9 Findings from Focus Group Interviews
  • 9.1 Overview of the coding process and scheme
  • 9.2 Evaluating the ID Matching method: overall perceptions
  • 9.2.1 Evaluating the ID Matching method (receptive skills)
  • 9.2.2 Evaluating the ID Matching method (productive skills)
  • 9.3 Establishing the beginning of the level with the ID Matching method
  • 9.4 Factors affecting panelists’ judgments
  • 9.5 Using the CEFR descriptors instead of Performance Level Descriptors
  • 9.6 Evaluating the virtual synchronous environment
  • 9.7 Evaluating the panelist discussion in terms of CEFR referencing
  • 9.8 Conclusion: findings from focus group interviews
  • CHAPTER 10 Discussion
  • 10.1 The ID Matching method to standard-set and benchmark productive skills
  • 10.2 The ID Matching method to standard set receptive skills
  • 10.3 The challenges of using the CEFR as PLDs
  • 10.4 Expanding the breadth of the standard setting stage
  • 10.5 Expanding the breadth of CEFR alignment studies
  • 10.6 The F2F and virtual environments
  • CHAPTER 11 Synopsis of Study
  • CHAPTER 12 Contribution
  • 12.1 Recommendations
  • Recommendations for the ID Matching method in receptive skills
  • Recommendations for the ID Matching method in productive skills
  • Recommendations for familiarization and standardization activities
  • Observations and recommendations for virtual standard setting workshops
  • 12.2 Implications
  • CEFR familiarization and training in the method activities
  • The OIB and the number of items included in it
  • Evaluating the reasonableness of cut scores
  • Panel composition in a CEFR ID Matching standard setting workshop
  • 12.3 Limitations
  • 12.4 Concluding remarks
  • Bibliography
  • Appendixes
  • Appendix A: Panelist characteristics
  • Appendix B: Focus group interviews protocol
  • Focus group interviews: introductory statement and questions
  • Introductory statement
  • Introductory question
  • Focus questions for ID Matching method: Reading
  • Transition
  • Key questions
  • Probe questions for both Reading and Writing
  • Focus group questions for ID Matching method: Writing
  • Transition
  • Key questions
  • Probe questions for both Reading and Writing
  • Focus group question for the environment of the standard setting study
  • Transition
  • Key questions
  • Ending questions
  • Appendix C: The Partial Credit Wright Item Map
  • Appendix D: Procedural evidence evaluation questionnaires
  • Appendix E: Panelist measurement report in Reading-into-writing (Study A)
  • Appendix F: Panelist measurement report in Reading-into-writing (Study B)
  • Appendix G: Panelist measurement report in Reading (Study A)
  • Appendix H: Panelist measurement report in Reading (Study B)
  • Appendix I: MPI Reading, Round 1 (Study A)
  • Appendix J: MPI Reading, Round 1 (Study B)
  • Appendix K: MPI Reading, Round 2 (Study B)
  • Appendix L: Codes and themes
  • Appendix M: Coder agreement
  • Appendix N: Conceptual mapping of panelist discussion
  • Appendix O: Example of a top-down familiarization activity
  • Name Index
  • Subject Index

Figures

Figure 2.1: Embedded Standard Setting iterative process in SIPS

Figure 2.2: Validity evidence of linkage of examinations/test results to the CEFR

Figure 2.3: Visual representation of procedures to relate examinations to the CEFR

Figure 2.4: Model for linking a test to the CEFR

Figure 2.5: Steps in the alignment process

Figure 3.1: Overview of Study A F2F standard setting & benchmarking workshop

Figure 3.2: ID Matching method procedures

Figure 4.1: Multi-phase mixed-methods design

Figure 4.2: Overview of the study

Figure 4.3: Virtual workshop snapshot

Figure 4.4: Structure of focus group interviews

Figure 4.5: Online OIB example page

Figure 4.6: Study B item map example

Figure 4.7: Coding methods summary (focus group data)

Figure 5.1: Orientation & Training evaluation, Study A (N = 12) & Study B (N = 9)

Figure 5.2: Reading phase evaluation Study A (n = 11) & Study B (n = 7)

Figure 5.3: Reading-into-writing phase evaluation, Study A (n = 10) & Study B (n = 6)

Figure 6.1: CEFR judgment agreement on common scripts, Study A (F2F, N = 11)

Figure 6.2: CEFR judgment agreement on common scripts, Study B (virtual, N = 9)

Figure 7.1: CEFR judgment agreement on common Reading items, Study A (F2F, N = 12)

Figure 7.2: CEFR judgment agreement on common Reading items, Study B (Virtual, N = 9)

Figure 7.3: Use of CEFR scales in F2F and virtual workshops (Reading)

Figure 8.1: Framework for calculating threshold region(s) and cut score(s)

Figure 9.1: An overview of the themes and codes

Figure 9.2: An overview of the hierarchy of the themes and codes

Figure 9.3: Cluster analysis on word similarity

Figure 9.4: Word tree rationalizing B2 judgments in the virtual environment

Figure 9.5: Word tree around “text” discussion in the virtual environment

Figure 9.6: Conceptual mapping of the discussion to the CEFR scales

Figure 9.7: Conceptual mapping of the discussion to the CEFR-level scales

Figure 10.1: Model for a CEFR linking study with an item-mapping method

Figure 10.2: Structure of the unified alignment and test design (UATD) process

Figure 12.1: Monitoring panelist engagement

Tables

Table 2.1: Hypothetical illustration of a threshold region in the ID Matching method

Table 4.1: Standard setting agenda for the virtual workshop, Study B

Table 4.2: Expanded Cizek & Earnest (2016) evaluation framework

Table 4.3: Materials and instruments used in Studies A (F2F) & B (Virtual)

Table 4.4: Examples of evaluation questionnaire modifications

Table 4.5: Summary of data collected in Study A

Table 4.6: Summary of data collected in Study B

Table 4.7: Summary of data collected in Study C

Table 4.8: Data collected—Reading

Table 4.9: Data collected—Reading-into-writing task

Table 4.10: Panelist judgments of scripts—Reading-into-writing

Table 4.11: Coding CEFR-level judgments to numeric values

Table 4.12: Data collected & analyses overview

Table 5.1: Influence on Reading standard setting judgments

Table 5.2: Influence on Reading-into-writing standard setting judgments

Table 6.1: Descriptor frequency in reading-into-writing task

Table 6.2: Inter-panelist agreement & consistency: Reading-into-writing

Table 6.3: Judgment variance (Study A—Writing, N = 11)

Table 6.4: Judgment variance (Study B—Writing, N = 9)

Table 6.5: Inter-panelist agreement & consistency indices (Reading-into-writing)

Table 6.6: Panelist unexpected responses

Table 6.7: Intra-panelist agreement & consistency (Reading-into-writing)

Table 6.8: Summary of fit statistics for Reading-into-writing

Table 6.9: Mean severity: Externals vs internals (Reading-into-writing, R1)

Table 6.10: Pairwise interactions on scripts (internals vs externals, Study A, F2F, N = 11)

Table 6.11: Pairwise interactions on scripts (internals vs externals, Study B, Virtual, N = 9)

Table 6.12: CEFR judgments for Script 5 in both environments

Table 6.13: Evaluating the Reading-into-writing recommended cut scores (N = 1,111)

Table 6.14: Mean severity (F2F vs Virtual), Reading-into-writing

Table 6.15: Pairwise interaction between mode and written scripts, Study A & B

Table 6.16: Pairwise interaction between environment and panelist

Table 6.17: Final average CEFR-level judgments in the F2F and virtual workshop

Table 7.1: Inter-panelist agreement & consistency on holistic CEFR item judgments (Reading)

Table 7.2: Inter-panelist agreement & consistency on analytic CEFR Reading item judgments

Table 7.3: Summary of inter-panelist agreement & consistency indices (Reading)

Table 7.4: Intra-panelist consistency between empirical data and Reading judgments, MPI

Table 7.5: Intra-panelist agreement (holistic & analytic), Study A (F2F), Reading, (N = 12)

Table 7.6: Intra-panelist agreement (holistic & analytic), Study B (virtual), Reading, (N = 9)

Table 7.7: Intra-panelist reliability (holistic) Reading, Study B (virtual, N = 9)

Table 7.8: Summary of fit statistics

Table 7.9: Comparing the severity of the two panelist subgroups in the F2F & virtual workshops

Table 7.10: Pairwise interactions (subgroups & tasks, F2F vs Virtual)

Table 7.11: Cut score locations, Study A (F2F, N = 12)

Table 7.12: Cut score locations, Study B (N = 9)

Table 7.13: Evaluating the error in Reading cut scores, Study A, (N = 12)

Table 7.14: Evaluating the error in Reading cut scores, Study B (N = 9)

Table 7.15: Evaluation of recommended Reading cut scores (N = 1,109)

Table 7.16: Panel severity on common Reading items (F2F vs virtual panels)

Table 7.17: DGF analysis on common Reading items (F2F vs virtual panels)

Table 7.18: Panelist pairwise interactions on common Reading items (F2F vs virtual)

Table 7.19: CEFR judgments on common Reading items

Table 7.20: Test-taker classification (N = 1,109)

Table 8.1: Coefficients from linear regression analysis (n = 1,103)

Table 8.2: Distance of cut scores from population mean (logit & RS)

Table 8.3: Item clusters via Wald statistics

Table 8.4: Predictive power of item clusters (n = 1,103)

Table 8.5: Calculated cut score locations (N = 1,109)

Table 8.6: DA & DC of calculated cut scores (N = 1,109)

Table 8.7: Test-taker classification on calculated cut scores (N = 1,109)

Table 9.1: Panelist affiliation & experience, (Study C, Virtual, N = 9)

Table 9.2: Intercoder agreement

Table 9.3: Coding scheme (theme 1/RQ6)

Table 9.4: Coding scheme (theme 2/RQ6.1)

Table 9.5: Coding scheme (theme 3/RQ6.1)

Table 9.6: Coding scheme (theme 4/RQ6.1)

Table 9.7: Coding scheme (theme 5/RQ6.1)

Table 9.8: Coding scheme (theme 6/RQ6.2)

Table 9.9: Coding scheme (theme 7/RQ6.3)

Table 9.10: Relationship of sources in cluster analysis

Abbreviations

ACJ Adaptive comparative judgment

ALDs Achievement Level Descriptors

ALTE Association for Language Testers in Europe

AO Awarding Organization

BoW Body of Work

BSSP Bookmark Standard Setting Procedure

CEFR Common European Framework of Reference for Languages

CJ Comparative judgment

CI (LL, UL) Confidence Interval lower level, upper level

CLT Central Limited Theorem

CREL Conditional reliability

CS Cut score

CSEM Conditional standard error of measurement

CTT Classical Test Theory

CW Creative Writing

DA Decision accuracy

DGF Differential group functioning

DIF Differential item functioning

DPF Differential panelist functioning

DPJ Dominant Profile Judgment

EALTA European Association for Language Testing and Assessment

ESS Embedded Standard Setting

ESSA Every Student Succeeds Act

ETS Educational Testing Service

FGs Focus groups

GEPT General English Proficiency Tests

ICC Intraclass Correlation Coefficient

IELTS International English Language Testing System

IRT Item Response Theory

ISE Integrated Skills in English

JPC Judgment policy capturing

KSA(s) Knowledge, skills, and abilities

KWIC Key word in context

LL Livingston and Lewis

LTA Language testing and assessment

MAPT Massachusetts Adult Proficiency Tests

MCC Minimally competent candidate

MFRM Many-Facet Rasch Measurement

MH Mantel-Haenszel

MPI Misplacement Index

MSPAP Maryland School Performance Program

NAEP National Assessment of Educational Progress

NAGB National Assessment Governing Board

NCLB No Child Left Behind

NDA Non-disclosure agreement

OIB Ordered item booklet

OPB Ordered profile booklet

ORC Overall Reading Comprehension

OSS Objective Standard Setting

OWP Overall Written Production

PADDI Principled assessment design, development, and implementation

Details

Pages
432
Publication Year
2026
ISBN (PDF)
9783631921678
ISBN (ePUB)
9783631921685
ISBN (Hardcover)
9783631921661
DOI
10.3726/b23366
Open Access
CC-BY
Language
English
Publication date
2026 (April)
Keywords
CEFR standard setting cut scores evaluation framework Unified Alignment & Test Development (UATD) principled cut score approach virtual workshops face-to-face workshops comparative study standard setting theory language testing
Published
Berlin, Bruxelles, Chennai, Lausanne, New York, Oxford, 2026. 432 pp., 29 fig. col., 15 fig. b/w, 84 tables.
Product Safety
Peter Lang Group AG

Biographical notes

Paraskevi (Voula) Kanistra (Author)

Paraskevi (Voula) Kanistra is Director of English Language Assessment at Trinity College London. She has nearly thirty years of experience in language assessment, including work as an examiner trainer, test developer, and assessment lead, with a strong focus on test design, validation, and quality assurance across international contexts. Her professional expertise centres on CEFR alignment, standard setting, validation, and assessment innovation. She has served as Treasurer of EALTA and has acted as a reviewer for journals and conferences.

Previous

Title: Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment