Evaluating the Item Descriptor (ID) Matching Method in a Face-to-Face and Synchronous Virtual Environment
Summary
Excerpt
Table Of Contents
- Cover
- Title Page
- Copyright Page
- Dedication
- Table of Contents
- Figures
- Tables
- Abbreviations
- Preface
- Acknowledgments
- Abstract
- CHAPTER 1 Introduction
- 1.1 Structure of the book
- CHAPTER 2 Literature Review
- 2.1 An overview of standard setting
- 2.1.1 The Angoff method
- Advantages and disadvantages of the Angoff method
- 2.1.2 The Bookmark Standard Setting Procedure
- The ordered item booklet
- The panelist task and cut score calculation
- Advantages and disadvantages of the Bookmark Standard Setting Procedure
- 2.1.3 The Item Descriptor Matching method
- Background
- Description of the ID standard setting task
- Description of the standard setting process
- Advantages and disadvantages of the Item Descriptor Matching method
- 2.1.4 The Embedded Standard Setting method
- Examinee-centered methods
- 2.1.5 The Performance Profile method
- Advantages and disadvantages of the Performance Profile method
- 2.1.6 The Dominant Profile method
- Advantages and disadvantages of the Dominant Profile Method
- 2.1.7 The Body of Work method
- Advantages and disadvantages of the Body of Work Method
- 2.2 Aligning examinations to the CEFR
- 2.3 Issues with the CEFR and with aligning examinations to the CEFR
- 2.4 Virtual standard setting
- 2.5 Contextualizing the research study
- CHAPTER 3 Background to the Study
- 3.1 Description of the ISE examination
- 3.1.1 Selection of tasks
- 3.2 Methodology and process of the Trinity benchmarking study
- 3.2.1 The standard setting process
- 3.2.2 The standard setting method
- 3.2.3 Participants
- 3.3 Summary of background to the study
- CHAPTER 4 Methodology
- 4.1 Aim, design, and research questions
- Research questions
- 4.1.1 Study design
- 4.1.2 Overview of the study
- 4.1.3 Conducting the virtual workshop
- 4.1.4 Background to focus group interviews
- 4.1.5 Focus group interviews
- 4.2 Framework for evaluating standard setting workshops
- 4.3 Materials and data collection instruments
- Orientation and training in the method materials
- The ordered item booklet and item map
- 4.3.1 Evaluation questionnaires
- 4.4 Data collection
- 4.5 Methods of analyses
- 4.5.1 Data analyses for procedural evaluation
- 4.5.2 Data analyses for the internal evaluation of the standard setting workshops
- 4.5.2.1 Inter-panelist and intra-panelist consistency within the CTT paradigm
- 4.5.2.2 Inter-panelist and intra-panelist consistency within the RMT paradigm
- RMT background
- Inter-panelist and intra-panelist consistency
- Synopsis of the Rasch inter- and intra-panelist indices
- 4.5.2.3 Consistency within the method
- 4.5.2.4 Decision accuracy and consistency
- 4.5.3 Data analyses for external evaluation
- 4.5.4 Analyzing focus group interviews
- 4.5.5 Analyzing panelist discussion (Study B)
- 4.6 Summary of methodology
- CHAPTER 5 Procedural Validity
- 5.1 Evaluating the Orientation and Training in the method stages
- 5.2 Evaluating the standard setting of the Reading component
- 5.3 Evaluating the benchmarking study of the Reading-into-writing component
- 5.4 Influencing panelist judgments
- 5.5 Conclusion: procedural validity
- CHAPTER 6 Validating the Reading-Into-Writing Workshops
- 6.1 Inter- and intra-panelist consistency for the Reading-into-writing component
- 6.1.1 Inter-panelist consistency: CTT framework
- 6.1.2 Inter-panelist consistency: RMT framework
- 6.1.3 Intra-panelist consistency: CTT framework
- 6.1.4 Intra-panelist consistency: RMT framework
- 6.2 Consistency within the method for the Reading-into-writing component
- 6.2.1 Comparing the internal and external panelist groups
- 6.2.2 Evaluating the accuracy and precision of the Reading-into-writing cut score
- 6.2.3 Decision accuracy and consistency for the Reading-into-writing cut score
- 6.3 External validity
- 6.3.1 Comparing panelist groups across modes: Reading-into-writing
- 6.3.2 DGF across studies and modes: Reading-into-writing
- 6.3.3 DPF across studies and modes: Reading-into-writing
- 6.3.4 Consistency, impact, and reasonableness of the Reading-into-writing judgments
- 6.4 Conclusion: validating the Reading-into-writing component
- CHAPTER 7 Validating the Reading Workshops
- 7.1 Inter- and intra-panelist consistency for the Reading component
- 7.1.1 Inter-panelist consistency: CTT framework
- 7.1.2 Inter-panelist consistency: RMT framework
- 7.1.3 Intra-panelist consistency: CTT framework
- 7.1.4 Intra-panelist consistency: RMT framework
- 7.2 Consistency within the method for the Reading component
- 7.2.1 Comparing the internal and external panelist groups: Reading Study A
- 7.2.2 Locating the recommended cut score for the Reading component
- 7.2.3 Evaluating the accuracy and precision of the Reading cut score
- 7.2.4 Decision accuracy and consistency for the Reading cut score
- 7.3 External validity
- 7.3.1 Comparing panelists across modes: Reading Component
- 7.3.2 DGF across studies and modes: Reading
- 7.3.3 DPF across studies and modes: Reading
- 7.3.4 Consistency of the Reading judgments
- 7.3.5 Reasonableness of recommended cut scores
- 7.4 Conclusion: validating the Reading component
- CHAPTER 8 Calculating Cut Scores in a Single-Level Examination
- 8.1 Framework for calculating threshold region(s) and cut score(s)
- 8.2 Operationalizing the framework for calculating cut scores
- Step 1: Establishing the predictive power of each item
- Step 2: Converting ability measures or raw scores to z scores
- Step 3: Establishing item clusters
- Step 4: Exploring the predictive power of the calculated threshold regions
- Step 5: Evaluating the calculated cut scores
- CHAPTER 9 Findings from Focus Group Interviews
- 9.1 Overview of the coding process and scheme
- 9.2 Evaluating the ID Matching method: overall perceptions
- 9.2.1 Evaluating the ID Matching method (receptive skills)
- 9.2.2 Evaluating the ID Matching method (productive skills)
- 9.3 Establishing the beginning of the level with the ID Matching method
- 9.4 Factors affecting panelists’ judgments
- 9.5 Using the CEFR descriptors instead of Performance Level Descriptors
- 9.6 Evaluating the virtual synchronous environment
- 9.7 Evaluating the panelist discussion in terms of CEFR referencing
- 9.8 Conclusion: findings from focus group interviews
- CHAPTER 10 Discussion
- 10.1 The ID Matching method to standard-set and benchmark productive skills
- 10.2 The ID Matching method to standard set receptive skills
- 10.3 The challenges of using the CEFR as PLDs
- 10.4 Expanding the breadth of the standard setting stage
- 10.5 Expanding the breadth of CEFR alignment studies
- 10.6 The F2F and virtual environments
- CHAPTER 11 Synopsis of Study
- CHAPTER 12 Contribution
- 12.1 Recommendations
- Recommendations for the ID Matching method in receptive skills
- Recommendations for the ID Matching method in productive skills
- Recommendations for familiarization and standardization activities
- Observations and recommendations for virtual standard setting workshops
- 12.2 Implications
- CEFR familiarization and training in the method activities
- The OIB and the number of items included in it
- Evaluating the reasonableness of cut scores
- Panel composition in a CEFR ID Matching standard setting workshop
- 12.3 Limitations
- 12.4 Concluding remarks
- Bibliography
- Appendixes
- Appendix A: Panelist characteristics
- Appendix B: Focus group interviews protocol
- Focus group interviews: introductory statement and questions
- Introductory statement
- Introductory question
- Focus questions for ID Matching method: Reading
- Transition
- Key questions
- Probe questions for both Reading and Writing
- Focus group questions for ID Matching method: Writing
- Transition
- Key questions
- Probe questions for both Reading and Writing
- Focus group question for the environment of the standard setting study
- Transition
- Key questions
- Ending questions
- Appendix C: The Partial Credit Wright Item Map
- Appendix D: Procedural evidence evaluation questionnaires
- Appendix E: Panelist measurement report in Reading-into-writing (Study A)
- Appendix F: Panelist measurement report in Reading-into-writing (Study B)
- Appendix G: Panelist measurement report in Reading (Study A)
- Appendix H: Panelist measurement report in Reading (Study B)
- Appendix I: MPI Reading, Round 1 (Study A)
- Appendix J: MPI Reading, Round 1 (Study B)
- Appendix K: MPI Reading, Round 2 (Study B)
- Appendix L: Codes and themes
- Appendix M: Coder agreement
- Appendix N: Conceptual mapping of panelist discussion
- Appendix O: Example of a top-down familiarization activity
- Name Index
- Subject Index
Figures
Figure 2.1: Embedded Standard Setting iterative process in SIPS
Figure 2.2: Validity evidence of linkage of examinations/test results to the CEFR
Figure 2.3: Visual representation of procedures to relate examinations to the CEFR
Figure 2.4: Model for linking a test to the CEFR
Figure 2.5: Steps in the alignment process
Figure 3.1: Overview of Study A F2F standard setting & benchmarking workshop
Figure 3.2: ID Matching method procedures
Figure 4.1: Multi-phase mixed-methods design
Figure 4.2: Overview of the study
Figure 4.3: Virtual workshop snapshot
Figure 4.4: Structure of focus group interviews
Figure 4.5: Online OIB example page
Figure 4.6: Study B item map example
Figure 4.7: Coding methods summary (focus group data)
Figure 5.1: Orientation & Training evaluation, Study A (N = 12) & Study B (N = 9)
Figure 5.2: Reading phase evaluation Study A (n = 11) & Study B (n = 7)
Figure 5.3: Reading-into-writing phase evaluation, Study A (n = 10) & Study B (n = 6)
Figure 6.1: CEFR judgment agreement on common scripts, Study A (F2F, N = 11)
Figure 6.2: CEFR judgment agreement on common scripts, Study B (virtual, N = 9)
Figure 7.1: CEFR judgment agreement on common Reading items, Study A (F2F, N = 12)
Figure 7.2: CEFR judgment agreement on common Reading items, Study B (Virtual, N = 9)
Figure 7.3: Use of CEFR scales in F2F and virtual workshops (Reading)
Figure 8.1: Framework for calculating threshold region(s) and cut score(s)
Figure 9.1: An overview of the themes and codes
Figure 9.2: An overview of the hierarchy of the themes and codes
Figure 9.3: Cluster analysis on word similarity
Figure 9.4: Word tree rationalizing B2 judgments in the virtual environment
Figure 9.5: Word tree around “text” discussion in the virtual environment
Figure 9.6: Conceptual mapping of the discussion to the CEFR scales
Figure 9.7: Conceptual mapping of the discussion to the CEFR-level scales
Figure 10.1: Model for a CEFR linking study with an item-mapping method
Figure 10.2: Structure of the unified alignment and test design (UATD) process
Figure 12.1: Monitoring panelist engagement
Tables
Table 2.1: Hypothetical illustration of a threshold region in the ID Matching method
Table 4.1: Standard setting agenda for the virtual workshop, Study B
Table 4.2: Expanded Cizek & Earnest (2016) evaluation framework
Table 4.3: Materials and instruments used in Studies A (F2F) & B (Virtual)
Table 4.4: Examples of evaluation questionnaire modifications
Table 4.5: Summary of data collected in Study A
Table 4.6: Summary of data collected in Study B
Table 4.7: Summary of data collected in Study C
Table 4.8: Data collected—Reading
Table 4.9: Data collected—Reading-into-writing task
Table 4.10: Panelist judgments of scripts—Reading-into-writing
Table 4.11: Coding CEFR-level judgments to numeric values
Table 4.12: Data collected & analyses overview
Table 5.1: Influence on Reading standard setting judgments
Table 5.2: Influence on Reading-into-writing standard setting judgments
Table 6.1: Descriptor frequency in reading-into-writing task
Table 6.2: Inter-panelist agreement & consistency: Reading-into-writing
Table 6.3: Judgment variance (Study A—Writing, N = 11)
Table 6.4: Judgment variance (Study B—Writing, N = 9)
Table 6.5: Inter-panelist agreement & consistency indices (Reading-into-writing)
Table 6.6: Panelist unexpected responses
Table 6.7: Intra-panelist agreement & consistency (Reading-into-writing)
Table 6.8: Summary of fit statistics for Reading-into-writing
Table 6.9: Mean severity: Externals vs internals (Reading-into-writing, R1)
Table 6.10: Pairwise interactions on scripts (internals vs externals, Study A, F2F, N = 11)
Table 6.11: Pairwise interactions on scripts (internals vs externals, Study B, Virtual, N = 9)
Table 6.12: CEFR judgments for Script 5 in both environments
Table 6.13: Evaluating the Reading-into-writing recommended cut scores (N = 1,111)
Table 6.14: Mean severity (F2F vs Virtual), Reading-into-writing
Table 6.15: Pairwise interaction between mode and written scripts, Study A & B
Table 6.16: Pairwise interaction between environment and panelist
Table 6.17: Final average CEFR-level judgments in the F2F and virtual workshop
Table 7.1: Inter-panelist agreement & consistency on holistic CEFR item judgments (Reading)
Table 7.2: Inter-panelist agreement & consistency on analytic CEFR Reading item judgments
Table 7.3: Summary of inter-panelist agreement & consistency indices (Reading)
Table 7.4: Intra-panelist consistency between empirical data and Reading judgments, MPI
Table 7.5: Intra-panelist agreement (holistic & analytic), Study A (F2F), Reading, (N = 12)
Table 7.6: Intra-panelist agreement (holistic & analytic), Study B (virtual), Reading, (N = 9)
Table 7.7: Intra-panelist reliability (holistic) Reading, Study B (virtual, N = 9)
Table 7.8: Summary of fit statistics
Table 7.9: Comparing the severity of the two panelist subgroups in the F2F & virtual workshops
Table 7.10: Pairwise interactions (subgroups & tasks, F2F vs Virtual)
Table 7.11: Cut score locations, Study A (F2F, N = 12)
Table 7.12: Cut score locations, Study B (N = 9)
Table 7.13: Evaluating the error in Reading cut scores, Study A, (N = 12)
Table 7.14: Evaluating the error in Reading cut scores, Study B (N = 9)
Table 7.15: Evaluation of recommended Reading cut scores (N = 1,109)
Table 7.16: Panel severity on common Reading items (F2F vs virtual panels)
Table 7.17: DGF analysis on common Reading items (F2F vs virtual panels)
Table 7.18: Panelist pairwise interactions on common Reading items (F2F vs virtual)
Table 7.19: CEFR judgments on common Reading items
Table 7.20: Test-taker classification (N = 1,109)
Table 8.1: Coefficients from linear regression analysis (n = 1,103)
Table 8.2: Distance of cut scores from population mean (logit & RS)
Table 8.3: Item clusters via Wald statistics
Table 8.4: Predictive power of item clusters (n = 1,103)
Table 8.5: Calculated cut score locations (N = 1,109)
Table 8.6: DA & DC of calculated cut scores (N = 1,109)
Table 8.7: Test-taker classification on calculated cut scores (N = 1,109)
Table 9.1: Panelist affiliation & experience, (Study C, Virtual, N = 9)
Table 9.2: Intercoder agreement
Table 9.3: Coding scheme (theme 1/RQ6)
Table 9.4: Coding scheme (theme 2/RQ6.1)
Table 9.5: Coding scheme (theme 3/RQ6.1)
Table 9.6: Coding scheme (theme 4/RQ6.1)
Table 9.7: Coding scheme (theme 5/RQ6.1)
Table 9.8: Coding scheme (theme 6/RQ6.2)
Table 9.9: Coding scheme (theme 7/RQ6.3)
Table 9.10: Relationship of sources in cluster analysis
Abbreviations
ACJ Adaptive comparative judgment
ALDs Achievement Level Descriptors
ALTE Association for Language Testers in Europe
AO Awarding Organization
BoW Body of Work
BSSP Bookmark Standard Setting Procedure
CEFR Common European Framework of Reference for Languages
CJ Comparative judgment
CI (LL, UL) Confidence Interval lower level, upper level
CLT Central Limited Theorem
CREL Conditional reliability
CS Cut score
CSEM Conditional standard error of measurement
CTT Classical Test Theory
CW Creative Writing
DA Decision accuracy
DGF Differential group functioning
DIF Differential item functioning
DPF Differential panelist functioning
DPJ Dominant Profile Judgment
EALTA European Association for Language Testing and Assessment
ESS Embedded Standard Setting
ESSA Every Student Succeeds Act
ETS Educational Testing Service
FGs Focus groups
GEPT General English Proficiency Tests
ICC Intraclass Correlation Coefficient
IELTS International English Language Testing System
IRT Item Response Theory
ISE Integrated Skills in English
JPC Judgment policy capturing
KSA(s) Knowledge, skills, and abilities
KWIC Key word in context
LL Livingston and Lewis
LTA Language testing and assessment
MAPT Massachusetts Adult Proficiency Tests
MCC Minimally competent candidate
MFRM Many-Facet Rasch Measurement
MH Mantel-Haenszel
MPI Misplacement Index
MSPAP Maryland School Performance Program
NAEP National Assessment of Educational Progress
NAGB National Assessment Governing Board
NCLB No Child Left Behind
NDA Non-disclosure agreement
OIB Ordered item booklet
OPB Ordered profile booklet
ORC Overall Reading Comprehension
OSS Objective Standard Setting
OWP Overall Written Production
PADDI Principled assessment design, development, and implementation
Details
- Pages
- 432
- Publication Year
- 2026
- ISBN (PDF)
- 9783631921678
- ISBN (ePUB)
- 9783631921685
- ISBN (Hardcover)
- 9783631921661
- DOI
- 10.3726/b23366
- Open Access
- CC-BY
- Language
- English
- Publication date
- 2026 (April)
- Keywords
- CEFR standard setting cut scores evaluation framework Unified Alignment & Test Development (UATD) principled cut score approach virtual workshops face-to-face workshops comparative study standard setting theory language testing
- Published
- Berlin, Bruxelles, Chennai, Lausanne, New York, Oxford, 2026. 432 pp., 29 fig. col., 15 fig. b/w, 84 tables.
- Product Safety
- Peter Lang Group AG