Loading...

Virtual Standard Setting: Setting Cut Scores

by Charalambos Kollias (Author)
©2023 Monographs 302 Pages
Series: Language Testing and Evaluation, Volume 46

Summary

Virtual standard setting became more popular since the global outbreak of Covid-19 in 2020. Standard setting practitioners needed to conduct either cut score studies and/or linking studies online. The research presented in this book predates Covid-19 and explores virtual standard setting in two e-communication media (audio and video) and then compares them to the face-to-face environment. The interplay of quantitative methods [i.e., classical test theory (CTT) and Rasch measurement theory (RMT)] and qualitative methods [(i.e., constant comparative method (CCM), and media naturalness theory (MNT)] unravel Ariadne’s thread into the labyrinth of virtual standard setting. Illustrative examples of how to conduct and evaluate a virtual workshop are offered to stimulate standard setting practitioners to embrace the opportunities of the virtual environment.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the author
  • About the book
  • This eBook can be cited
  • Acknowledgements
  • Table of contents
  • List of figures
  • List of Tables
  • List of acronyms
  • Chapter 1: Introduction
  • 1.1 Overview of the study
  • 1.2 Scope of the study
  • 1.3 Outline of the chapters
  • Chapter 2: Literature review
  • 2.1 Background to standard setting
  • 2.2 The importance of setting valid cut scores
  • 2.2.1 Standard setting methods
  • 2.2.1.1 Examples of test-centred methods
  • Variants of the Angoff method
  • The Bookmark method
  • The Objective Standard Setting (OSS) method
  • 2.2.1.2 Examples of examinee-centred methods
  • The Borderline Group (BG) method and the Contrasting Group (CG) method
  • The Body of Work (BoW) method
  • 2.2.2 Evaluating and validating standard setting methods
  • 2.3 Standard setting in language assessment
  • 2.3.1 Current LTA standard setting research
  • 2.3.1.1 The first publicly available CEFR alignment studies
  • 2.3.1.2 Studies investigating understanding of method or CEFR
  • 2.3.1.3 Studies investigating external validity evidence
  • 2.3.1.4 Studies proposing new methods/modifications
  • 2.4 Challenges associated with standard setting
  • 2.4.1 Theoretical and practical challenges
  • 2.4.2 Logistics
  • 2.5 Virtual standard setting
  • 2.5.1 Virtual standard setting: Empirical studies
  • 2.5.2 Challenges associated with virtual standard setting
  • 2.6 Media naturalness theory
  • 2.6.1 Re-evaluating virtual standard setting studies through MNT
  • 2.7 Summary
  • Chapter 3: Methodology
  • 3.1 Research aim and questions
  • 3.2 Methods
  • 3.2.1 Embedded MMR design
  • 3.2.2 Counterbalanced workshop design
  • 3.2.3 Instruments
  • 3.2.3.1 Web-conferencing platform and data collection platform
  • 3.2.3.2 Test instrument
  • 3.2.3.3 CEFR familiarisation verification activities
  • 3.2.3.4 Recruiting participants
  • 3.2.3.5 Workshop surveys
  • 3.2.3.6 Focus group interviews
  • 3.2.3.7 Ethical considerations
  • 3.3 Standard setting methodology
  • 3.3.1 Rationale for the Yes/No Angoff method
  • 3.3.2 Pre-workshop platform training
  • 3.3.3 In preparation for the virtual workshop
  • 3.3.4 Description of the workshop stages
  • 3.3.4.1 Introduction stage
  • 3.3.4.2 Orientation stage
  • 3.3.4.2.1 CEFR familiarisation verification activity A
  • 3.3.4.2.2 CEFR familiarisation verification activity B
  • 3.3.4.2.3 Familiarisation with the test instrument
  • 3.3.4.3 Method training stage
  • 3.3.4.4 Judgement stage
  • Round 1 Stage:
  • Round 2 Stage:
  • Round 3 Stage:
  • 3.4 Data analysis methods and frameworks
  • 3.4.1 CEFR verification activities analysis
  • 3.4.2 Internal validity of cut scores
  • Classical test theory (CTT)
  • Rasch measurement theory (RMT)
  • The many-facet Rasch measurement (MFRM) model
  • 3.4.3 Comparability of virtual cut score measures
  • 3.4.4 Differential severity
  • 3.4.5 Survey analysis
  • 3.4.6 Focus group interview analysis
  • 3.6 Summary
  • Chapter 4: Cut score data analysis
  • 4.1 Cut score internal validation: MFRM analysis
  • 4.1.1 Rasch group level indices
  • 4.1.2 Judge level indices
  • 4.2 Cut score internal validation: CTT analysis
  • 4.2.1 Consistency within the method
  • 4.2.2 Intraparticipant consistency
  • 4.2.3 Interparticipant consistency
  • 4.2.4 Decision consistency and accuracy
  • The Livingston and Lewis method:
  • The Standard Error method
  • 4.3 Comparability of cut scores between media and environments
  • 4.3.1 Comparability of virtual cut score measures
  • 4.3.2 Comparability of virtual and F2F cut score measures
  • 4.4 Differential severity between medium, judges, and panels
  • 4.4.1 Differential judge functioning (DJF)
  • 4.4.2 Differential medium functioning (DMF)
  • 4.4.3 Differential group functioning (DGF)
  • 4.5 Summary
  • Chapter 5: Survey data analysis
  • 5.1 Survey instruments
  • 5.2 Perception survey instrument
  • 5.2.1 Evaluating the perception survey instruments
  • 5.2.2 Analysis of perception survey items
  • Qualitative comments for communication item 1:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 2:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 3:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 4:
  • Qualitative comments for communication item 5:
  • Audio
  • Video medium
  • Qualitative comments for communication item 6:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 7:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 8:
  • Audio medium
  • Video medium
  • Qualitative comments for communication item 9:
  • Audio medium
  • The video medium
  • 5.3 Procedural survey items
  • 5.3.1 Evaluating the procedural survey instruments
  • 5.4 Summary
  • Chapter 6: Focus group interview data analysis
  • 6.1 Analysis of transcripts
  • 6.2 Findings
  • 6.2.1 Psychological aspects
  • Distraction in the video medium
  • Self-consciousness in the video medium
  • Lack of non-verbal feedback in the audio medium
  • Inability to distinguish speaker in the audio medium
  • Inability to discern who was paying attention in audio medium
  • Cognitive strain in the audio medium
  • 6.2.2 Interaction
  • Lack of small talk in virtual environments
  • No digression from the topic in virtual environments
  • Differences in amounts of discussion between virtual and F2F settings
  • 6.2.3 Technical aspects
  • Technical problems in virtual environments
  • Turn-taking system
  • 6.2.4 Convenience
  • Time saved in virtual environments
  • Freedom to multi-task in virtual environments
  • Less fatigue in virtual environments
  • 6.2.5 Decision-making in virtual environments
  • 6.3 Summary
  • Chapter 7: Integration and discussion of findings
  • 7.1 Research questions
  • 7.1.1 Research questions 1, 2, and 3
  • 7.1.2 Research question 4
  • 7.1.3 Research question 5
  • 7.2 Limitations
  • 7.3 Summary
  • Chapter 8: Implications, future research, and conclusion
  • 8.1 Significance and contribution to the field
  • 8.2 Guidance for conducting synchronous virtual cut score studies
  • Demands for facilitators and/or co-facilitators
  • Establishing a virtual standard setting netiquette
  • Selecting a suitable virtual platform
  • Selecting an appropriate medium for the workshop
  • Recruiting online participants
  • Training in the virtual platform
  • Uploading materials
  • Monitoring progress and engaging judges
  • 8.3 Recommendations for future research
  • 8.4 Concluding remarks
  • Appendices
  • Appendix A CEFR verification activity A (Key)
  • Appendix B Electronic consent form
  • Appendix C Judge background questionnaire
  • Appendix D Focus group protocol
  • Introductory statement
  • Focus group interview questions
  • Appendix E Facilitator’s virtual standard setting protocol
  • Appendix F CEFR familiarisation verification activity results
  • Appendix G: Facets specification file
  • Appendix H: Intraparticipant consistency indices
  • Appendix I: Group 5 group level and individual level Rasch indices
  • Appendix J: Form A & Form B score tables
  • Appendix K: DJF pairwise interactions
  • Appendix L: DGF pairwise interactions
  • Appendix M: Wright maps
  • References
  • Author index
  • Subject index
  • Series index

←18 | 19→

List of figures

Figure 2.1 The media naturalness scale

Figure 3.1 The study’s embedded MMR design

Figure 3.2 Overview of counterbalanced virtual workshop design

Figure 3.3 The e-platform placed on the media naturalness scale

Figure 3.4 CEFR familiarisation verification activities

Figure 3.5 Surveys administered to each panel during each workshop

Figure 3.6 Focus group sessions

Figure 3.7 Example of e-platform: equipment check session

Figure 3.8 Example of e-platform: audio medium session

Figure 3.9 Example of e-platform: video medium session

Figure 3.10 Overview of the workshop stages for each session

Figure 3.11 Example of CEFR familiarisation verification activity A

Figure 3.12 Example of CEFR familiarisation verification activity B

Figure 3.13 Example of CEFR familiarisation verification activity feedback 1

Figure 3.14 Example of CEFR familiarisation verification activity feedback 2

Figure 3.15 Example of grammar subsection familiarisation

Figure 3.16 Example of Round 1 virtual rating form

Figure 3.17 Example of panellist normative information feedback

Figure 3.18 Example of Round 2 virtual rating form

Figure 3.19 Group 1 normative information and consequences feedback

Figure 3.20 Round 3 virtual rating form

Figure 3.21 Overview of the quantitative and qualitative data collected

Figure 3.22 Data analysis for internal validity: CTT

Figure 3.23 Data analysis for internal validity: RMT

Figure 3.24 CCM process for analysing focus group transcripts

Figure 3.25 Coding process within CCM

←20 | 21→

List of tables

Table 2.1 Summary of elements for evaluating standard setting

Table 2.2 Summary of standard setting expenses

Table 3.1 BCCETM GVR section: Original vs. shortened versions

Table 3.2 Summary of workshop participants

Table 3.3 Examples of survey adaptations

Table 3.4 Materials uploaded onto virtual platforms

Table 3.5 Virtual session duration

Table 3.6 Overview of RQs, instruments, data collected, and analysis

Table 4.1 Group 1 group level Rasch indices

Table 4.2 Group 2 group level Rasch indices

Table 4.3 Group 3 group level Rasch indices

Table 4.4 Group 4 group level Rasch indices

Table 4.5 Group 1 individual level Rasch indices

Table 4.6 Group 2 individual level Rasch indices

Table 4.7 Group 3 individual level Rasch indices

Table 4.8 Group 4 individual Rasch level indices

Table 4.9 Psychometric characteristics of Test Form A and Test Form B

Table 4.10 All groups internal consistency within method check

Table 4.11 Intraparticipant consistency indices per round and test form

Table 4.12 Changes in ratings across Round 1 and Round 2

Table 4.13 Logit changes in ratings across Round 2 and Round 3

Table 4.14 Interparticipant indices: Form A

Table 4.15 Interparticipant indices: Form B

Table 4.16 Accuracy and consistency estimates for Form A raw cut scores

Table 4.17 Accuracy and consistency estimates for Form B raw cut scores

Table 4.18 Form A and Form B pass/fail rates

Table 4.19 Percentage of correct classifications per group and test form

Table 4.20 Round 1 virtual cut score measure comparisons

Table 4.21 Round 2 virtual cut score measure comparisons

Table 4.22 Round 3 virtual cut score measure comparisons

Table 4.23 Round 1 virtual and F2F cut score measure comparisons

Table 4.24 Round 2 virtual and F2F cut score measure comparisons

←21 | 22→

Table 4.25 Round 3 virtual & Round 2 F2F cut score measure comparisons

Table 4.26 DMF analysis of all judgements per medium

Table 4.27 DMF analysis of all judgements per medium, within test form

Table 4.28 DGF analysis across all judgements between media per group

Table 4.29 Round 1 DGF pairwise interactions within groups

Table 4.30 Round 2 DGF pairwise interactions

Table 4.31 Round 3 DGF pairwise interactions

Table 5.1 Psychometric characteristics of perception survey instruments

Table 5.2 Frequency data of the perception survey instruments

Table 5.3 Wilcoxon signed-rank test/ Sign test communication item 1

Table 5.4 Wilcoxon signed-rank test/ Sign test communication item 2

Table 5.5 Wilcoxon signed-rank test/ Sign test communication item 3

Table 5.6 Wilcoxon signed-rank test/ Sign test communication item 4

Table 5.7 Wilcoxon signed-rank test/ Sign test communication item 5

Table 5.8 Wilcoxon signed-rank test/ Sign test communication item 6

Table 5.9 Wilcoxon signed-rank test/ Sign test communication item 7

Table 5.10 Wilcoxon signed-rank test/ Sign test communication item 8

Table 5.11 Wilcoxon signed-rank test/ Sign test communication item 9

Table 5.12 Wilcoxon signed-rank test/ Sign test communication item 10

Table 5.13 Wilcoxon signed-rank test/ Sign test communication item 11

Table 5.14 Wilcoxon signed-rank test/ Sign test platform item 1

Table 5.15 Wilcoxon signed-rank test/ Sign test platform item 2

Table 5.16 Psychometric characteristics of procedural survey instruments

Table 5.17 Frequency data of procedural survey instruments

Table 6.1 Coding scheme

Table 8.1 Virtual standard setting platform framework

←24 | 25→

Chapter 1: Introduction

The purpose of this chapter is to provide a broad introduction to the study. The chapter is divided into three main sections, with the first section providing an overview to the study. The next section discusses the scope of the study, while the final section presents the structure of the study.

1.1 Overview of the study

The overall aim of the study was to further investigate virtual standard setting by examining the feasibility of replicating a F2F standard setting workshop conducted in 2011 in two virtual environments, audio-only (henceforth “audio”) and audio-visual (henceforth “video”), and to explore factors that may impact cut scores. First, standard setting, as used in the study, is defined and the practical challenges associated with it are presented. Next, an overview of the findings from the few empirical virtual standard setting studies that have been conducted are presented and areas of virtual standard setting which warrant further investigation are discussed. Finally, the rationale for the study along with the contributions it sought to make are presented.

Standard setting is a decision-making process of setting a cut score – a certain point on a test scale used for classifying test takers into at least two different categories (Cizek, Bunch, & Koons, 2004; Hambleton & Eignor, 1978; Kaftandjieva, 2010). The standard setting process usually entails recruiting a group of panellists to complete a variety of tasks with the aim of recommending a cut score which usually equates to a pass/ fail decision on a certain test instrument. Some of the key challenges associated with conducting a standard setting workshop range from purely academic issues such as selecting the most appropriate method to set cut scores to very practical issues such as recruiting panellists and arranging accommodation. It is such practical issues involved in conducting a cut score study that may result in such workshops either not being replicated at regular intervals (Dunlea & Figueras, 2012) to examine whether cut scores have changed or, in some cases, not being conducted at all (Tannenbaum, 2013).

Recruiting panellists for a standard setting workshop places a heavy financial burden on the awarding body commissioning the cut score study. The external costs associated with conducting such a study usually entail hiring a suitable venue, offering panellists a financial incentive for participating in the study (per ←25 | 26→diem or lump sum) and when panellists are associated with a university, the university also receives a sum for contracting their lecturers. Furthermore, when an awarding body has limited human resources, it may need to hire temporary staff to help with the amount of preparation needed to conduct the workshop. For example, a large volume of photocopies needs to be made so that all panellists have their own sets of materials (i.e., training materials, the test instrument, ratings forms, etc.) that will be used during the study. In the cases where the awarding body cannot conduct the cut score study themselves, standard setting practitioners need to be contracted for the study. There are also internal costs associated with standard setting meetings such as internal meetings held amongst staff to organise the cut score studies, the follow-up meetings to discuss the recommended cut scores and their implications, and even the write-up of the cut score study itself. In some studies, qualified internal staff may participate as panellists in the standard setting sessions to reduce the external costs. The time that internal staff devote to performing the activities is time reduced from their everyday activities, duties, and responsibilities, which usually equates to there being a backlog of work to be done.

Some standard setting practitioners (Harvey and Way, 1999; Katz, Tannenbaum, & Kannan, 2009; Schnipke & Becker, 2007) have started exploring the feasibility of setting cut scores in virtual environments to offset the external costs associated with F2F standard setting. Virtual environments here are defined as artificial environments in which geographically isolated participants engage in computer-mediated conversation with one another through e-communication tools (i.e., emails, audio-conferencing, and videoconferencing). The very few empirical virtual standard setting research studies that have been published (Harvey & Way, 1999; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009) to date have confirmed that it is feasible to conduct a standard setting workshop in (1) an asynchronous virtual environment – one in which panellists are not necessarily in the virtual environment at the same time or in (2) a combined synchronous and asynchronous environment, in which one or more parts of a cut score study are conducted in real time, while other parts are conducted offline. These studies have also revealed that virtual standard setting can be conducted through different e-communication media such as emails, audio-conferencing and/or call conferencing and even through a combination of audio-conferencing and videoconferencing. While such findings paint a positive picture of virtual standard setting, it is an area of standard setting that still remains under-investigated.

The empirical virtual standard setting studies published to date have been conducted in a series of smaller sessions. However, in a F2F setting the duration ←26 | 27→of a cut score study on a language examination may range from approximately 1 to 1.5 days, when a cut score is to be set on one single instrument measuring a single skill (e.g., listening, reading, writing, or speaking) to even eight days when multiple cut scores need to be set on multiple instruments. The feasibility of the length of the virtual sessions has yet to be investigated. The demands placed on both the panellists’ equipment (i.e., computers, cameras, microphones, bandwidth requirements, etc.) and on the panellists themselves (e.g., fatigue, motivation, distractions, etc.) may be too great, resulting in some of the participants withdrawing from the study or the study itself not being completed.

Little is known about whether an appropriate e-communication medium to conduct a virtual standard setting study exists, and if so, how a standard setting workshop might best be conducted within that medium. None of the published virtual standard setting studies have compared two different e-communication media (i.e., audio and video) to explore whether using different e-communication media (i.e., audio-conferencing, videoconferencing) results in comparable and/or equally reliable cut scores. What is also not clear is to what degree the virtual medium can affect panellists’ decision-making processes and/or their perceptions and evaluations of the virtual environment. A related issue is how such perceptions are to be evaluated. In the literature on standard setting, specific guidance for conducting and evaluating cut scores is provided (Cizek & Earnest, 2016; Council of Europe, 2009; Kaftandjieva, 2004; Kane, 2001; Pitoniak, 2003; Zieky, Perie, & Livingston, 2008); however, the translation of this guidance to the virtual environment requires further exploration.

1.2 Scope of the study

This study seeks to address the gap that exists in the virtual standard setting literature. The aim of this study was threefold. The first aim was to investigate whether a particular e-communication medium (audio or video) was more appropriate than the other when replicating a F2F standard setting workshop. The aim was addressed through (1) selecting a web-conferencing platform for the study which could be used for both audio-conferencing and videoconferencing and (2) recruiting four groups of panellists to participate in two synchronous virtual sessions lasting approximately six hours (with breaks) each.

The second aim was to investigate whether the cut scores set via the two e-communication media (audio and video) were reliable and comparable, and as such would allow valid inferences to be drawn for cut score interpretations, and whether the virtual cut scores were comparable with previously set F2F cut scores. This aim was addressed though employing an embedded mixed method, ←27 | 28→counterbalanced research design. To explore the comparability of the virtual cut scores between and across panels and media, two similar test instruments previously equated through a complex mathematical procedure (Rasch) were used. The reliability and the internal validity of the virtual cut scores were investigated by applying Kane’s framework (Kane, 2001). The virtual cut scores were also compared with cut scores previously set on the same test instruments in a F2F environment.

The third aim was to explore whether either of the e-communication media (audio and video) affected the panellists’ decision-making processes as well as the panellists’ perceptions and evaluations of how well they communicated in each medium. This aim was investigated quantitatively through an analysis of survey data and qualitatively through an analysis of open-ended survey questions and focus group transcripts. The quantitative and qualitative findings were integrated and discussed with reference to media naturalness theory (MNT) (Kock, 2004, 2005, 2010) to gain new insights into virtual standard setting.

The study sought to contribute to the limited research in virtual standard setting in three ways: (1) theoretical; (2) practical; and (3) methodological. The first contribution of the study was (i) to provide evidence of the theoretical feasibility of conducting a synchronous virtual standard setting study, simulating F2F conditions, and (ii) to test a theoretical framework for evaluating qualitative data collected from virtual standard setting panellists by drawing from the principles of MNT. The next contribution was to provide a practical framework for conducting virtual standard setting by providing guidance to standard setting practitioners. The final contribution of the study was to provide a methodological framework for analysing multiple panel cut scores through equating and anchoring test instruments to their respective difficulty levels. It also added to the scarce literature of evaluating cut score data through MFRM (Eckes, 2009; Eckes, 2011/2015; Hsieh, 2013; Kaliski et al., 2012).

1.3 Outline of the chapters

This study is presented in eight chapters. Chapter 1 provides the introduction, while chapter 2 provides a review of the literature, with a particular focus on conducting standard setting in virtual environments. First, standard setting is defined in relation to norm-referenced and criterion-referenced test score interpretations and then defined for the purpose of this study as a decision-making activity. Second, the importance of standard setting is described, key elements to its evaluation are discussed, and examples of standard setting methods are presented. Third, the role of standard setting in the field of language ←28 | 29→testing and assessment (LTA) is discussed and current standard setting research is presented. Fourth, associated challenges of conducting F2F standard setting are discussed. Next, the limited number of virtual standard setting studies reported to date are critically evaluated and associated challenges of conducting virtual standard setting are presented. Finally, MNT is presented, and the virtual standard setting theories are re-evaluated through its principles to identify the gap in the research literature.

Details

Pages
302
Year
2023
ISBN (PDF)
9783631889046
ISBN (ePUB)
9783631889053
ISBN (Hardcover)
9783631805398
DOI
10.3726/b20407
Language
English
Publication date
2023 (February)
Published
Berlin, Bern, Bruxelles, New York, Oxford, Warszawa, Wien, 2023. 302 pp., 2 fig. col., 24 fig. b/w, 58 tables

Biographical notes

Charalambos Kollias (Author)

Charalambos Kollias is a research director – psychometrician at an educational research organisation. He has worked in the field of language assessment for over 30 years in roles ranging from examiner trainer, assessment specialist to measurement analyst. He is experienced in conducting and evaluating (virtual) standard setting workshops, linking local and international examinations to global frameworks.

Previous

Title: Virtual Standard Setting: Setting Cut Scores
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
book preview page numper 27
book preview page numper 28
book preview page numper 29
book preview page numper 30
book preview page numper 31
book preview page numper 32
book preview page numper 33
book preview page numper 34
book preview page numper 35
book preview page numper 36
book preview page numper 37
book preview page numper 38
book preview page numper 39
book preview page numper 40
304 pages