Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment

Name: Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment
Availability: InStock
Author: Paraskevi (Voula) Kanistra
ISBN: 978-3-631-92168-5

Kanistra, Paraskevi (Voula)

Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment

by Paraskevi (Voula) Kanistra (Author)

Linguistics

Open Access

Series: Language Testing and Evaluation, Volume 50

Summary

This book turns the page on standard setting, calling for a time of change. Expanding Cizek & Earnest's (2016) evaluation framework, it delivers a comprehensive mixed-methods investigation of Ferrara & Lewis' (2012) Item Descriptor Matching method, using a multi-phase design across face-to-face and virtual workshops. At its core lies the author's Unified Alignment & Test Development (UATD) framework, which embeds a unique quantitative principled cut score approach to calculate defensible, trustworthy thresholds grounded in theory. Using corpus linguistics and AI models to analyse panel discussions, the book shows how innovative methodologies enhance the validity and robustness of CEFR-linking studies. It also translates the panelists' bottom-up and top-down strategies to tame the CEFR into innovative activities for the familiarisation stage. The result is an expanded, transparent, and forward-looking practice that strengthens the validity, fairness, and impact of standard setting.

Excerpt

Cover
Title Page
Copyright Page
Dedication
Table of Contents
Figures
Tables
Abbreviations
Preface
Acknowledgments
Abstract
CHAPTER 1 Introduction
1.1 Structure of the book
CHAPTER 2 Literature Review
2.1 An overview of standard setting
2.1.1 The Angoff method
Advantages and disadvantages of the Angoff method
2.1.2 The Bookmark Standard Setting Procedure
The ordered item booklet
The panelist task and cut score calculation
Advantages and disadvantages of the Bookmark Standard Setting Procedure
2.1.3 The Item Descriptor Matching method
Background
Description of the ID standard setting task
Description of the standard setting process
Advantages and disadvantages of the Item Descriptor Matching method
2.1.4 The Embedded Standard Setting method
Examinee-centered methods
2.1.5 The Performance Profile method
Advantages and disadvantages of the Performance Profile method
2.1.6 The Dominant Profile method
Advantages and disadvantages of the Dominant Profile Method
2.1.7 The Body of Work method
Advantages and disadvantages of the Body of Work Method
2.2 Aligning examinations to the CEFR
2.3 Issues with the CEFR and with aligning examinations to the CEFR
2.4 Virtual standard setting
2.5 Contextualizing the research study
CHAPTER 3 Background to the Study
3.1 Description of the ISE examination
3.1.1 Selection of tasks
3.2 Methodology and process of the Trinity benchmarking study
3.2.1 The standard setting process
3.2.2 The standard setting method
3.2.3 Participants
3.3 Summary of background to the study
CHAPTER 4 Methodology
4.1 Aim, design, and research questions
Research questions
4.1.1 Study design
4.1.2 Overview of the study
4.1.3 Conducting the virtual workshop
4.1.4 Background to focus group interviews
4.1.5 Focus group interviews
4.2 Framework for evaluating standard setting workshops
4.3 Materials and data collection instruments
Orientation and training in the method materials
The ordered item booklet and item map
4.3.1 Evaluation questionnaires
4.4 Data collection
4.5 Methods of analyses
4.5.1 Data analyses for procedural evaluation
4.5.2 Data analyses for the internal evaluation of the standard setting workshops
4.5.2.1 Inter-panelist and intra-panelist consistency within the CTT paradigm
4.5.2.2 Inter-panelist and intra-panelist consistency within the RMT paradigm
RMT background
Inter-panelist and intra-panelist consistency
Synopsis of the Rasch inter- and intra-panelist indices
4.5.2.3 Consistency within the method
4.5.2.4 Decision accuracy and consistency
4.5.3 Data analyses for external evaluation
4.5.4 Analyzing focus group interviews
4.5.5 Analyzing panelist discussion (Study B)
4.6 Summary of methodology
CHAPTER 5 Procedural Validity
5.1 Evaluating the Orientation and Training in the method stages
5.2 Evaluating the standard setting of the Reading component
5.3 Evaluating the benchmarking study of the Reading-into-writing component
5.4 Influencing panelist judgments
5.5 Conclusion: procedural validity
CHAPTER 6 Validating the Reading-Into-Writing Workshops
6.1 Inter- and intra-panelist consistency for the Reading-into-writing component
6.1.1 Inter-panelist consistency: CTT framework
6.1.2 Inter-panelist consistency: RMT framework
6.1.3 Intra-panelist consistency: CTT framework
6.1.4 Intra-panelist consistency: RMT framework
6.2 Consistency within the method for the Reading-into-writing component
6.2.1 Comparing the internal and external panelist groups
6.2.2 Evaluating the accuracy and precision of the Reading-into-writing cut score
6.2.3 Decision accuracy and consistency for the Reading-into-writing cut score
6.3 External validity
6.3.1 Comparing panelist groups across modes: Reading-into-writing
6.3.2 DGF across studies and modes: Reading-into-writing
6.3.3 DPF across studies and modes: Reading-into-writing
6.3.4 Consistency, impact, and reasonableness of the Reading-into-writing judgments
6.4 Conclusion: validating the Reading-into-writing component
CHAPTER 7 Validating the Reading Workshops
7.1 Inter- and intra-panelist consistency for the Reading component
7.1.1 Inter-panelist consistency: CTT framework
7.1.2 Inter-panelist consistency: RMT framework
7.1.3 Intra-panelist consistency: CTT framework
7.1.4 Intra-panelist consistency: RMT framework
7.2 Consistency within the method for the Reading component
7.2.1 Comparing the internal and external panelist groups: Reading Study A
7.2.2 Locating the recommended cut score for the Reading component
7.2.3 Evaluating the accuracy and precision of the Reading cut score
7.2.4 Decision accuracy and consistency for the Reading cut score
7.3 External validity
7.3.1 Comparing panelists across modes: Reading Component
7.3.2 DGF across studies and modes: Reading
7.3.3 DPF across studies and modes: Reading
7.3.4 Consistency of the Reading judgments
7.3.5 Reasonableness of recommended cut scores
7.4 Conclusion: validating the Reading component
CHAPTER 8 Calculating Cut Scores in a Single-Level Examination
8.1 Framework for calculating threshold region(s) and cut score(s)
8.2 Operationalizing the framework for calculating cut scores
Step 1: Establishing the predictive power of each item
Step 2: Converting ability measures or raw scores to z scores
Step 3: Establishing item clusters
Step 4: Exploring the predictive power of the calculated threshold regions
Step 5: Evaluating the calculated cut scores
CHAPTER 9 Findings from Focus Group Interviews
9.1 Overview of the coding process and scheme
9.2 Evaluating the ID Matching method: overall perceptions
9.2.1 Evaluating the ID Matching method (receptive skills)
9.2.2 Evaluating the ID Matching method (productive skills)
9.3 Establishing the beginning of the level with the ID Matching method
9.4 Factors affecting panelists’ judgments
9.5 Using the CEFR descriptors instead of Performance Level Descriptors
9.6 Evaluating the virtual synchronous environment
9.7 Evaluating the panelist discussion in terms of CEFR referencing
9.8 Conclusion: findings from focus group interviews
CHAPTER 10 Discussion
10.1 The ID Matching method to standard-set and benchmark productive skills
10.2 The ID Matching method to standard set receptive skills
10.3 The challenges of using the CEFR as PLDs
10.4 Expanding the breadth of the standard setting stage
10.5 Expanding the breadth of CEFR alignment studies
10.6 The F2F and virtual environments
CHAPTER 11 Synopsis of Study
CHAPTER 12 Contribution
12.1 Recommendations
Recommendations for the ID Matching method in receptive skills
Recommendations for the ID Matching method in productive skills
Recommendations for familiarization and standardization activities
Observations and recommendations for virtual standard setting workshops
12.2 Implications
CEFR familiarization and training in the method activities
The OIB and the number of items included in it
Evaluating the reasonableness of cut scores
Panel composition in a CEFR ID Matching standard setting workshop
12.3 Limitations
12.4 Concluding remarks
Bibliography
Appendixes
Appendix A: Panelist characteristics
Appendix B: Focus group interviews protocol
Focus group interviews: introductory statement and questions
Introductory statement
Introductory question
Focus questions for ID Matching method: Reading
Transition
Key questions
Probe questions for both Reading and Writing
Focus group questions for ID Matching method: Writing
Transition
Key questions
Probe questions for both Reading and Writing
Focus group question for the environment of the standard setting study
Transition
Key questions
Ending questions
Appendix C: The Partial Credit Wright Item Map
Appendix D: Procedural evidence evaluation questionnaires
Appendix E: Panelist measurement report in Reading-into-writing (Study A)
Appendix F: Panelist measurement report in Reading-into-writing (Study B)
Appendix G: Panelist measurement report in Reading (Study A)
Appendix H: Panelist measurement report in Reading (Study B)
Appendix I: MPI Reading, Round 1 (Study A)
Appendix J: MPI Reading, Round 1 (Study B)
Appendix K: MPI Reading, Round 2 (Study B)
Appendix L: Codes and themes
Appendix M: Coder agreement
Appendix N: Conceptual mapping of panelist discussion
Appendix O: Example of a top-down familiarization activity
Name Index
Subject Index

Figures

Figure 2.1: Embedded Standard Setting iterative process in SIPS

Figure 2.2: Validity evidence of linkage of examinations/test results to the CEFR

Figure 2.3: Visual representation of procedures to relate examinations to the CEFR

Figure 2.4: Model for linking a test to the CEFR

Figure 2.5: Steps in the alignment process

Figure 3.1: Overview of Study A F2F standard setting & benchmarking workshop

Figure 3.2: ID Matching method procedures

Figure 4.1: Multi-phase mixed-methods design

Figure 4.2: Overview of the study

Figure 4.3: Virtual workshop snapshot

Figure 4.4: Structure of focus group interviews

Figure 4.5: Online OIB example page

Figure 4.6: Study B item map example

Figure 4.7: Coding methods summary (focus group data)

Figure 5.1: Orientation & Training evaluation, Study A (N = 12) & Study B (N = 9)

Figure 5.2: Reading phase evaluation Study A (n = 11) & Study B (n = 7)

Figure 5.3: Reading-into-writing phase evaluation, Study A (n = 10) & Study B (n = 6)

Figure 6.1: CEFR judgment agreement on common scripts, Study A (F2F, N = 11)

Figure 6.2: CEFR judgment agreement on common scripts, Study B (virtual, N = 9)

Figure 7.1: CEFR judgment agreement on common Reading items, Study A (F2F, N = 12)

Figure 7.2: CEFR judgment agreement on common Reading items, Study B (Virtual, N = 9)

Figure 7.3: Use of CEFR scales in F2F and virtual workshops (Reading)

Figure 8.1: Framework for calculating threshold region(s) and cut score(s)

Figure 9.1: An overview of the themes and codes

Figure 9.2: An overview of the hierarchy of the themes and codes

Figure 9.3: Cluster analysis on word similarity

Figure 9.4: Word tree rationalizing B2 judgments in the virtual environment

Figure 9.5: Word tree around “text” discussion in the virtual environment

Figure 9.6: Conceptual mapping of the discussion to the CEFR scales

Figure 9.7: Conceptual mapping of the discussion to the CEFR-level scales

Figure 10.1: Model for a CEFR linking study with an item-mapping method

Figure 10.2: Structure of the unified alignment and test design (UATD) process

Figure 12.1: Monitoring panelist engagement

Tables

Table 2.1: Hypothetical illustration of a threshold region in the ID Matching method

Table 4.1: Standard setting agenda for the virtual workshop, Study B

Table 4.2: Expanded Cizek & Earnest (2016) evaluation framework

Table 4.3: Materials and instruments used in Studies A (F2F) & B (Virtual)

Table 4.4: Examples of evaluation questionnaire modifications

Table 4.5: Summary of data collected in Study A

Table 4.6: Summary of data collected in Study B

Table 4.7: Summary of data collected in Study C

Table 4.8: Data collected—Reading

Table 4.9: Data collected—Reading-into-writing task

Table 4.10: Panelist judgments of scripts—Reading-into-writing

Table 4.11: Coding CEFR-level judgments to numeric values

Table 4.12: Data collected & analyses overview

Table 5.1: Influence on Reading standard setting judgments

Table 5.2: Influence on Reading-into-writing standard setting judgments

Table 6.1: Descriptor frequency in reading-into-writing task

Table 6.2: Inter-panelist agreement & consistency: Reading-into-writing

Table 6.3: Judgment variance (Study A—Writing, N = 11)

Table 6.4: Judgment variance (Study B—Writing, N = 9)

Table 6.5: Inter-panelist agreement & consistency indices (Reading-into-writing)

Table 6.6: Panelist unexpected responses

Table 6.7: Intra-panelist agreement & consistency (Reading-into-writing)

Table 6.8: Summary of fit statistics for Reading-into-writing

Table 6.9: Mean severity: Externals vs internals (Reading-into-writing, R1)

Table 6.10: Pairwise interactions on scripts (internals vs externals, Study A, F2F, N = 11)

Table 6.11: Pairwise interactions on scripts (internals vs externals, Study B, Virtual, N = 9)

Table 6.12: CEFR judgments for Script 5 in both environments

Table 6.13: Evaluating the Reading-into-writing recommended cut scores (N = 1,111)

Table 6.14: Mean severity (F2F vs Virtual), Reading-into-writing

Table 6.15: Pairwise interaction between mode and written scripts, Study A & B

Table 6.16: Pairwise interaction between environment and panelist

Table 6.17: Final average CEFR-level judgments in the F2F and virtual workshop

Table 7.1: Inter-panelist agreement & consistency on holistic CEFR item judgments (Reading)

Table 7.2: Inter-panelist agreement & consistency on analytic CEFR Reading item judgments

Table 7.3: Summary of inter-panelist agreement & consistency indices (Reading)

Table 7.4: Intra-panelist consistency between empirical data and Reading judgments, MPI

Table 7.5: Intra-panelist agreement (holistic & analytic), Study A (F2F), Reading, (N = 12)

Table 7.6: Intra-panelist agreement (holistic & analytic), Study B (virtual), Reading, (N = 9)

Table 7.7: Intra-panelist reliability (holistic) Reading, Study B (virtual, N = 9)

Table 7.8: Summary of fit statistics

Table 7.9: Comparing the severity of the two panelist subgroups in the F2F & virtual workshops

Table 7.10: Pairwise interactions (subgroups & tasks, F2F vs Virtual)

Table 7.11: Cut score locations, Study A (F2F, N = 12)

Table 7.12: Cut score locations, Study B (N = 9)

Table 7.13: Evaluating the error in Reading cut scores, Study A, (N = 12)

Table 7.14: Evaluating the error in Reading cut scores, Study B (N = 9)

Table 7.15: Evaluation of recommended Reading cut scores (N = 1,109)

Table 7.16: Panel severity on common Reading items (F2F vs virtual panels)

Table 7.17: DGF analysis on common Reading items (F2F vs virtual panels)

Table 7.18: Panelist pairwise interactions on common Reading items (F2F vs virtual)

Table 7.19: CEFR judgments on common Reading items

Table 7.20: Test-taker classification (N = 1,109)

Table 8.1: Coefficients from linear regression analysis (n = 1,103)

Table 8.2: Distance of cut scores from population mean (logit & RS)

Table 8.3: Item clusters via Wald statistics

Table 8.4: Predictive power of item clusters (n = 1,103)

Table 8.5: Calculated cut score locations (N = 1,109)

Table 8.6: DA & DC of calculated cut scores (N = 1,109)

Table 8.7: Test-taker classification on calculated cut scores (N = 1,109)

Table 9.1: Panelist affiliation & experience, (Study C, Virtual, N = 9)

Table 9.2: Intercoder agreement

Table 9.3: Coding scheme (theme 1/RQ6)

Table 9.4: Coding scheme (theme 2/RQ6.1)

Table 9.5: Coding scheme (theme 3/RQ6.1)

Table 9.6: Coding scheme (theme 4/RQ6.1)

Table 9.7: Coding scheme (theme 5/RQ6.1)

Table 9.8: Coding scheme (theme 6/RQ6.2)

Table 9.9: Coding scheme (theme 7/RQ6.3)

Table 9.10: Relationship of sources in cluster analysis

Abbreviations

ACJ Adaptive comparative judgment

ALDs Achievement Level Descriptors

ALTE Association for Language Testers in Europe

AO Awarding Organization

BoW Body of Work

BSSP Bookmark Standard Setting Procedure

CEFR Common European Framework of Reference for Languages

CJ Comparative judgment

CI (LL, UL) Confidence Interval lower level, upper level

CLT Central Limited Theorem

CREL Conditional reliability

CS Cut score

CSEM Conditional standard error of measurement

CTT Classical Test Theory

CW Creative Writing

DA Decision accuracy

DGF Differential group functioning

DIF Differential item functioning

DPF Differential panelist functioning

DPJ Dominant Profile Judgment

EALTA European Association for Language Testing and Assessment

ESS Embedded Standard Setting

ESSA Every Student Succeeds Act

ETS Educational Testing Service

FGs Focus groups

GEPT General English Proficiency Tests

ICC Intraclass Correlation Coefficient

IELTS International English Language Testing System

IRT Item Response Theory

ISE Integrated Skills in English

JPC Judgment policy capturing

KSA(s) Knowledge, skills, and abilities

KWIC Key word in context

LL Livingston and Lewis

LTA Language testing and assessment

MAPT Massachusetts Adult Proficiency Tests

MCC Minimally competent candidate

MFRM Many-Facet Rasch Measurement

MH Mantel-Haenszel

MPI Misplacement Index

MSPAP Maryland School Performance Program

NAEP National Assessment of Educational Progress

NAGB National Assessment Governing Board

NCLB No Child Left Behind

NDA Non-disclosure agreement

OIB Ordered item booklet

OPB Ordered profile booklet

ORC Overall Reading Comprehension

OSS Objective Standard Setting

OWP Overall Written Production

PADDI Principled assessment design, development, and implementation

Details

Pages: 432
Publication Year: 2026
ISBN (PDF): 9783631921678
ISBN (ePUB): 9783631921685
ISBN (Hardcover): 9783631921661
DOI: 10.3726/b23366
Open Access: CC-BY
Language: English
Publication date: 2026 (April)
Keywords: CEFR standard setting cut scores evaluation framework Unified Alignment & Test Development (UATD) principled cut score approach virtual workshops face-to-face workshops comparative study standard setting theory language testing
Published: Berlin, Bruxelles, Chennai, Lausanne, New York, Oxford, 2026. 432 pp., 29 fig. col., 15 fig. b/w, 84 tables.
Product Safety: Peter Lang Group AG

Biographical notes

Paraskevi (Voula) Kanistra (Author)

Paraskevi (Voula) Kanistra is Director of English Language Assessment at Trinity College London. She has nearly thirty years of experience in language assessment, including work as an examiner trainer, test developer, and assessment lead, with a strong focus on test design, validation, and quality assurance across international contexts. Her professional expertise centres on CEFR alignment, standard setting, validation, and assessment innovation. She has served as Treasurer of EALTA and has acted as a reviewer for journals and conferences.

Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment

Summary

Excerpt

Table Of Contents

Figures

Tables

Abbreviations

Details

Biographical notes

Key Subject Areas