Virtual Standard Setting: Setting Cut Scores
Summary
Excerpt
Table Of Contents
- Cover
- Title
- Copyright
- About the author
- About the book
- This eBook can be cited
- Acknowledgements
- Table of contents
- List of figures
- List of Tables
- List of acronyms
- Chapter 1: Introduction
- 1.1 Overview of the study
- 1.2 Scope of the study
- 1.3 Outline of the chapters
- Chapter 2: Literature review
- 2.1 Background to standard setting
- 2.2 The importance of setting valid cut scores
- 2.2.1 Standard setting methods
- 2.2.1.1 Examples of test-centred methods
- Variants of the Angoff method
- The Bookmark method
- The Objective Standard Setting (OSS) method
- 2.2.1.2 Examples of examinee-centred methods
- The Borderline Group (BG) method and the Contrasting Group (CG) method
- The Body of Work (BoW) method
- 2.2.2 Evaluating and validating standard setting methods
- 2.3 Standard setting in language assessment
- 2.3.1 Current LTA standard setting research
- 2.3.1.1 The first publicly available CEFR alignment studies
- 2.3.1.2 Studies investigating understanding of method or CEFR
- 2.3.1.3 Studies investigating external validity evidence
- 2.3.1.4 Studies proposing new methods/modifications
- 2.4 Challenges associated with standard setting
- 2.4.1 Theoretical and practical challenges
- 2.4.2 Logistics
- 2.5 Virtual standard setting
- 2.5.1 Virtual standard setting: Empirical studies
- 2.5.2 Challenges associated with virtual standard setting
- 2.6 Media naturalness theory
- 2.6.1 Re-evaluating virtual standard setting studies through MNT
- 2.7 Summary
- Chapter 3: Methodology
- 3.1 Research aim and questions
- 3.2 Methods
- 3.2.1 Embedded MMR design
- 3.2.2 Counterbalanced workshop design
- 3.2.3 Instruments
- 3.2.3.1 Web-conferencing platform and data collection platform
- 3.2.3.2 Test instrument
- 3.2.3.3 CEFR familiarisation verification activities
- 3.2.3.4 Recruiting participants
- 3.2.3.5 Workshop surveys
- 3.2.3.6 Focus group interviews
- 3.2.3.7 Ethical considerations
- 3.3 Standard setting methodology
- 3.3.1 Rationale for the Yes/No Angoff method
- 3.3.2 Pre-workshop platform training
- 3.3.3 In preparation for the virtual workshop
- 3.3.4 Description of the workshop stages
- 3.3.4.1 Introduction stage
- 3.3.4.2 Orientation stage
- 3.3.4.2.1 CEFR familiarisation verification activity A
- 3.3.4.2.2 CEFR familiarisation verification activity B
- 3.3.4.2.3 Familiarisation with the test instrument
- 3.3.4.3 Method training stage
- 3.3.4.4 Judgement stage
- Round 1 Stage:
- Round 2 Stage:
- Round 3 Stage:
- 3.4 Data analysis methods and frameworks
- 3.4.1 CEFR verification activities analysis
- 3.4.2 Internal validity of cut scores
- Classical test theory (CTT)
- Rasch measurement theory (RMT)
- The many-facet Rasch measurement (MFRM) model
- 3.4.3 Comparability of virtual cut score measures
- 3.4.4 Differential severity
- 3.4.5 Survey analysis
- 3.4.6 Focus group interview analysis
- 3.6 Summary
- Chapter 4: Cut score data analysis
- 4.1 Cut score internal validation: MFRM analysis
- 4.1.1 Rasch group level indices
- 4.1.2 Judge level indices
- 4.2 Cut score internal validation: CTT analysis
- 4.2.1 Consistency within the method
- 4.2.2 Intraparticipant consistency
- 4.2.3 Interparticipant consistency
- 4.2.4 Decision consistency and accuracy
- The Livingston and Lewis method:
- The Standard Error method
- 4.3 Comparability of cut scores between media and environments
- 4.3.1 Comparability of virtual cut score measures
- 4.3.2 Comparability of virtual and F2F cut score measures
- 4.4 Differential severity between medium, judges, and panels
- 4.4.1 Differential judge functioning (DJF)
- 4.4.2 Differential medium functioning (DMF)
- 4.4.3 Differential group functioning (DGF)
- 4.5 Summary
- Chapter 5: Survey data analysis
- 5.1 Survey instruments
- 5.2 Perception survey instrument
- 5.2.1 Evaluating the perception survey instruments
- 5.2.2 Analysis of perception survey items
- Qualitative comments for communication item 1:
- Audio medium
- Video medium
- Qualitative comments for communication item 2:
- Audio medium
- Video medium
- Qualitative comments for communication item 3:
- Audio medium
- Video medium
- Qualitative comments for communication item 4:
- Qualitative comments for communication item 5:
- Audio
- Video medium
- Qualitative comments for communication item 6:
- Audio medium
- Video medium
- Qualitative comments for communication item 7:
- Audio medium
- Video medium
- Qualitative comments for communication item 8:
- Audio medium
- Video medium
- Qualitative comments for communication item 9:
- Audio medium
- The video medium
- 5.3 Procedural survey items
- 5.3.1 Evaluating the procedural survey instruments
- 5.4 Summary
- Chapter 6: Focus group interview data analysis
- 6.1 Analysis of transcripts
- 6.2 Findings
- 6.2.1 Psychological aspects
- Distraction in the video medium
- Self-consciousness in the video medium
- Lack of non-verbal feedback in the audio medium
- Inability to distinguish speaker in the audio medium
- Inability to discern who was paying attention in audio medium
- Cognitive strain in the audio medium
- 6.2.2 Interaction
- Lack of small talk in virtual environments
- No digression from the topic in virtual environments
- Differences in amounts of discussion between virtual and F2F settings
- 6.2.3 Technical aspects
- Technical problems in virtual environments
- Turn-taking system
- 6.2.4 Convenience
- Time saved in virtual environments
- Freedom to multi-task in virtual environments
- Less fatigue in virtual environments
- 6.2.5 Decision-making in virtual environments
- 6.3 Summary
- Chapter 7: Integration and discussion of findings
- 7.1 Research questions
- 7.1.1 Research questions 1, 2, and 3
- 7.1.2 Research question 4
- 7.1.3 Research question 5
- 7.2 Limitations
- 7.3 Summary
- Chapter 8: Implications, future research, and conclusion
- 8.1 Significance and contribution to the field
- 8.2 Guidance for conducting synchronous virtual cut score studies
- Demands for facilitators and/or co-facilitators
- Establishing a virtual standard setting netiquette
- Selecting a suitable virtual platform
- Selecting an appropriate medium for the workshop
- Recruiting online participants
- Training in the virtual platform
- Uploading materials
- Monitoring progress and engaging judges
- 8.3 Recommendations for future research
- 8.4 Concluding remarks
- Appendices
- Appendix A CEFR verification activity A (Key)
- Appendix B Electronic consent form
- Appendix C Judge background questionnaire
- Appendix D Focus group protocol
- Introductory statement
- Focus group interview questions
- Appendix E Facilitator’s virtual standard setting protocol
- Appendix F CEFR familiarisation verification activity results
- Appendix G: Facets specification file
- Appendix H: Intraparticipant consistency indices
- Appendix I: Group 5 group level and individual level Rasch indices
- Appendix J: Form A & Form B score tables
- Appendix K: DJF pairwise interactions
- Appendix L: DGF pairwise interactions
- Appendix M: Wright maps
- References
- Author index
- Subject index
- Series index
List of figures
Figure 2.1 The media naturalness scale
Figure 3.1 The study’s embedded MMR design
Figure 3.2 Overview of counterbalanced virtual workshop design
Figure 3.3 The e-platform placed on the media naturalness scale
Figure 3.4 CEFR familiarisation verification activities
Figure 3.5 Surveys administered to each panel during each workshop
Figure 3.6 Focus group sessions
Figure 3.7 Example of e-platform: equipment check session
Figure 3.8 Example of e-platform: audio medium session
Figure 3.9 Example of e-platform: video medium session
Figure 3.10 Overview of the workshop stages for each session
Figure 3.11 Example of CEFR familiarisation verification activity A
Figure 3.12 Example of CEFR familiarisation verification activity B
Figure 3.13 Example of CEFR familiarisation verification activity feedback 1
Figure 3.14 Example of CEFR familiarisation verification activity feedback 2
Figure 3.15 Example of grammar subsection familiarisation
Figure 3.16 Example of Round 1 virtual rating form
Figure 3.17 Example of panellist normative information feedback
Figure 3.18 Example of Round 2 virtual rating form
Figure 3.19 Group 1 normative information and consequences feedback
Figure 3.20 Round 3 virtual rating form
Figure 3.21 Overview of the quantitative and qualitative data collected
Figure 3.22 Data analysis for internal validity: CTT
Figure 3.23 Data analysis for internal validity: RMT
Figure 3.24 CCM process for analysing focus group transcripts
List of tables
Table 2.1 Summary of elements for evaluating standard setting
Table 2.2 Summary of standard setting expenses
Table 3.1 BCCETM GVR section: Original vs. shortened versions
Table 3.2 Summary of workshop participants
Table 3.3 Examples of survey adaptations
Table 3.4 Materials uploaded onto virtual platforms
Table 3.5 Virtual session duration
Table 3.6 Overview of RQs, instruments, data collected, and analysis
Table 4.1 Group 1 group level Rasch indices
Table 4.2 Group 2 group level Rasch indices
Table 4.3 Group 3 group level Rasch indices
Table 4.4 Group 4 group level Rasch indices
Table 4.5 Group 1 individual level Rasch indices
Table 4.6 Group 2 individual level Rasch indices
Table 4.7 Group 3 individual level Rasch indices
Table 4.8 Group 4 individual Rasch level indices
Table 4.9 Psychometric characteristics of Test Form A and Test Form B
Table 4.10 All groups internal consistency within method check
Table 4.11 Intraparticipant consistency indices per round and test form
Table 4.12 Changes in ratings across Round 1 and Round 2
Table 4.13 Logit changes in ratings across Round 2 and Round 3
Table 4.14 Interparticipant indices: Form A
Table 4.15 Interparticipant indices: Form B
Table 4.16 Accuracy and consistency estimates for Form A raw cut scores
Table 4.17 Accuracy and consistency estimates for Form B raw cut scores
Table 4.18 Form A and Form B pass/fail rates
Table 4.19 Percentage of correct classifications per group and test form
Table 4.20 Round 1 virtual cut score measure comparisons
Table 4.21 Round 2 virtual cut score measure comparisons
Table 4.22 Round 3 virtual cut score measure comparisons
Table 4.23 Round 1 virtual and F2F cut score measure comparisons
Table 4.24 Round 2 virtual and F2F cut score measure comparisons
←21 | 22→Table 4.25 Round 3 virtual & Round 2 F2F cut score measure comparisons
Table 4.26 DMF analysis of all judgements per medium
Table 4.27 DMF analysis of all judgements per medium, within test form
Table 4.28 DGF analysis across all judgements between media per group
Table 4.29 Round 1 DGF pairwise interactions within groups
Table 4.30 Round 2 DGF pairwise interactions
Table 4.31 Round 3 DGF pairwise interactions
Table 5.1 Psychometric characteristics of perception survey instruments
Table 5.2 Frequency data of the perception survey instruments
Table 5.3 Wilcoxon signed-rank test/ Sign test communication item 1
Table 5.4 Wilcoxon signed-rank test/ Sign test communication item 2
Table 5.5 Wilcoxon signed-rank test/ Sign test communication item 3
Table 5.6 Wilcoxon signed-rank test/ Sign test communication item 4
Table 5.7 Wilcoxon signed-rank test/ Sign test communication item 5
Table 5.8 Wilcoxon signed-rank test/ Sign test communication item 6
Table 5.9 Wilcoxon signed-rank test/ Sign test communication item 7
Table 5.10 Wilcoxon signed-rank test/ Sign test communication item 8
Table 5.11 Wilcoxon signed-rank test/ Sign test communication item 9
Table 5.12 Wilcoxon signed-rank test/ Sign test communication item 10
Table 5.13 Wilcoxon signed-rank test/ Sign test communication item 11
Table 5.14 Wilcoxon signed-rank test/ Sign test platform item 1
Table 5.15 Wilcoxon signed-rank test/ Sign test platform item 2
Table 5.16 Psychometric characteristics of procedural survey instruments
Chapter 1: Introduction
The purpose of this chapter is to provide a broad introduction to the study. The chapter is divided into three main sections, with the first section providing an overview to the study. The next section discusses the scope of the study, while the final section presents the structure of the study.
1.1 Overview of the study
The overall aim of the study was to further investigate virtual standard setting by examining the feasibility of replicating a F2F standard setting workshop conducted in 2011 in two virtual environments, audio-only (henceforth “audio”) and audio-visual (henceforth “video”), and to explore factors that may impact cut scores. First, standard setting, as used in the study, is defined and the practical challenges associated with it are presented. Next, an overview of the findings from the few empirical virtual standard setting studies that have been conducted are presented and areas of virtual standard setting which warrant further investigation are discussed. Finally, the rationale for the study along with the contributions it sought to make are presented.
Standard setting is a decision-making process of setting a cut score – a certain point on a test scale used for classifying test takers into at least two different categories (Cizek, Bunch, & Koons, 2004; Hambleton & Eignor, 1978; Kaftandjieva, 2010). The standard setting process usually entails recruiting a group of panellists to complete a variety of tasks with the aim of recommending a cut score which usually equates to a pass/ fail decision on a certain test instrument. Some of the key challenges associated with conducting a standard setting workshop range from purely academic issues such as selecting the most appropriate method to set cut scores to very practical issues such as recruiting panellists and arranging accommodation. It is such practical issues involved in conducting a cut score study that may result in such workshops either not being replicated at regular intervals (Dunlea & Figueras, 2012) to examine whether cut scores have changed or, in some cases, not being conducted at all (Tannenbaum, 2013).
Recruiting panellists for a standard setting workshop places a heavy financial burden on the awarding body commissioning the cut score study. The external costs associated with conducting such a study usually entail hiring a suitable venue, offering panellists a financial incentive for participating in the study (per ←25 | 26→diem or lump sum) and when panellists are associated with a university, the university also receives a sum for contracting their lecturers. Furthermore, when an awarding body has limited human resources, it may need to hire temporary staff to help with the amount of preparation needed to conduct the workshop. For example, a large volume of photocopies needs to be made so that all panellists have their own sets of materials (i.e., training materials, the test instrument, ratings forms, etc.) that will be used during the study. In the cases where the awarding body cannot conduct the cut score study themselves, standard setting practitioners need to be contracted for the study. There are also internal costs associated with standard setting meetings such as internal meetings held amongst staff to organise the cut score studies, the follow-up meetings to discuss the recommended cut scores and their implications, and even the write-up of the cut score study itself. In some studies, qualified internal staff may participate as panellists in the standard setting sessions to reduce the external costs. The time that internal staff devote to performing the activities is time reduced from their everyday activities, duties, and responsibilities, which usually equates to there being a backlog of work to be done.
Some standard setting practitioners (Harvey and Way, 1999; Katz, Tannenbaum, & Kannan, 2009; Schnipke & Becker, 2007) have started exploring the feasibility of setting cut scores in virtual environments to offset the external costs associated with F2F standard setting. Virtual environments here are defined as artificial environments in which geographically isolated participants engage in computer-mediated conversation with one another through e-communication tools (i.e., emails, audio-conferencing, and videoconferencing). The very few empirical virtual standard setting research studies that have been published (Harvey & Way, 1999; Katz & Tannenbaum, 2014; Katz, Tannenbaum, & Kannan, 2009) to date have confirmed that it is feasible to conduct a standard setting workshop in (1) an asynchronous virtual environment – one in which panellists are not necessarily in the virtual environment at the same time or in (2) a combined synchronous and asynchronous environment, in which one or more parts of a cut score study are conducted in real time, while other parts are conducted offline. These studies have also revealed that virtual standard setting can be conducted through different e-communication media such as emails, audio-conferencing and/or call conferencing and even through a combination of audio-conferencing and videoconferencing. While such findings paint a positive picture of virtual standard setting, it is an area of standard setting that still remains under-investigated.
The empirical virtual standard setting studies published to date have been conducted in a series of smaller sessions. However, in a F2F setting the duration ←26 | 27→of a cut score study on a language examination may range from approximately 1 to 1.5 days, when a cut score is to be set on one single instrument measuring a single skill (e.g., listening, reading, writing, or speaking) to even eight days when multiple cut scores need to be set on multiple instruments. The feasibility of the length of the virtual sessions has yet to be investigated. The demands placed on both the panellists’ equipment (i.e., computers, cameras, microphones, bandwidth requirements, etc.) and on the panellists themselves (e.g., fatigue, motivation, distractions, etc.) may be too great, resulting in some of the participants withdrawing from the study or the study itself not being completed.
Little is known about whether an appropriate e-communication medium to conduct a virtual standard setting study exists, and if so, how a standard setting workshop might best be conducted within that medium. None of the published virtual standard setting studies have compared two different e-communication media (i.e., audio and video) to explore whether using different e-communication media (i.e., audio-conferencing, videoconferencing) results in comparable and/or equally reliable cut scores. What is also not clear is to what degree the virtual medium can affect panellists’ decision-making processes and/or their perceptions and evaluations of the virtual environment. A related issue is how such perceptions are to be evaluated. In the literature on standard setting, specific guidance for conducting and evaluating cut scores is provided (Cizek & Earnest, 2016; Council of Europe, 2009; Kaftandjieva, 2004; Kane, 2001; Pitoniak, 2003; Zieky, Perie, & Livingston, 2008); however, the translation of this guidance to the virtual environment requires further exploration.
1.2 Scope of the study
This study seeks to address the gap that exists in the virtual standard setting literature. The aim of this study was threefold. The first aim was to investigate whether a particular e-communication medium (audio or video) was more appropriate than the other when replicating a F2F standard setting workshop. The aim was addressed through (1) selecting a web-conferencing platform for the study which could be used for both audio-conferencing and videoconferencing and (2) recruiting four groups of panellists to participate in two synchronous virtual sessions lasting approximately six hours (with breaks) each.
The second aim was to investigate whether the cut scores set via the two e-communication media (audio and video) were reliable and comparable, and as such would allow valid inferences to be drawn for cut score interpretations, and whether the virtual cut scores were comparable with previously set F2F cut scores. This aim was addressed though employing an embedded mixed method, ←27 | 28→counterbalanced research design. To explore the comparability of the virtual cut scores between and across panels and media, two similar test instruments previously equated through a complex mathematical procedure (Rasch) were used. The reliability and the internal validity of the virtual cut scores were investigated by applying Kane’s framework (Kane, 2001). The virtual cut scores were also compared with cut scores previously set on the same test instruments in a F2F environment.
The third aim was to explore whether either of the e-communication media (audio and video) affected the panellists’ decision-making processes as well as the panellists’ perceptions and evaluations of how well they communicated in each medium. This aim was investigated quantitatively through an analysis of survey data and qualitatively through an analysis of open-ended survey questions and focus group transcripts. The quantitative and qualitative findings were integrated and discussed with reference to media naturalness theory (MNT) (Kock, 2004, 2005, 2010) to gain new insights into virtual standard setting.
The study sought to contribute to the limited research in virtual standard setting in three ways: (1) theoretical; (2) practical; and (3) methodological. The first contribution of the study was (i) to provide evidence of the theoretical feasibility of conducting a synchronous virtual standard setting study, simulating F2F conditions, and (ii) to test a theoretical framework for evaluating qualitative data collected from virtual standard setting panellists by drawing from the principles of MNT. The next contribution was to provide a practical framework for conducting virtual standard setting by providing guidance to standard setting practitioners. The final contribution of the study was to provide a methodological framework for analysing multiple panel cut scores through equating and anchoring test instruments to their respective difficulty levels. It also added to the scarce literature of evaluating cut score data through MFRM (Eckes, 2009; Eckes, 2011/2015; Hsieh, 2013; Kaliski et al., 2012).
1.3 Outline of the chapters
This study is presented in eight chapters. Chapter 1 provides the introduction, while chapter 2 provides a review of the literature, with a particular focus on conducting standard setting in virtual environments. First, standard setting is defined in relation to norm-referenced and criterion-referenced test score interpretations and then defined for the purpose of this study as a decision-making activity. Second, the importance of standard setting is described, key elements to its evaluation are discussed, and examples of standard setting methods are presented. Third, the role of standard setting in the field of language ←28 | 29→testing and assessment (LTA) is discussed and current standard setting research is presented. Fourth, associated challenges of conducting F2F standard setting are discussed. Next, the limited number of virtual standard setting studies reported to date are critically evaluated and associated challenges of conducting virtual standard setting are presented. Finally, MNT is presented, and the virtual standard setting theories are re-evaluated through its principles to identify the gap in the research literature.
Details
- Pages
- 302
- Publication Year
- 2023
- ISBN (PDF)
- 9783631889046
- ISBN (ePUB)
- 9783631889053
- ISBN (Hardcover)
- 9783631805398
- DOI
- 10.3726/b20407
- Language
- English
- Publication date
- 2023 (February)
- Published
- Berlin, Bern, Bruxelles, New York, Oxford, Warszawa, Wien, 2023. 302 pp., 2 fig. col., 24 fig. b/w, 58 tables
- Product Safety
- Peter Lang Group AG