Loading...

Validating Analytic Rating Scales

A Multi-Method Approach to Scaling Descriptors for Assessing Academic Speaking

by Armin Berger (Author)
Thesis 395 Pages
Series: Language Testing and Evaluation, Volume 37

Table Of Content

  • Cover
  • Title
  • Copyright
  • About the Author
  • About the Book
  • This eBook can be cited
  • Table of Contents
  • Acknowledgements
  • List of figures
  • List of tables
  • List of abbreviations
  • 1 Introduction
  • 1.1 Background to the study
  • 1.2 Statement of the problem
  • 1.3 Purpose of the study
  • 1.4 Research questions
  • 1.5 Structure of the book
  • 2 Performance assessment of second language speaking
  • 2.1 Introduction to performance assessment
  • 2.2 The speaking construct in performance assessment
  • 2.2.1 Pre-communicative approaches
  • 2.2.2 Models of communicative competence
  • 2.2.3 Approaches to speaking
  • 2.3 Models of performance assessment
  • 2.3.1 McNamara (1996)
  • 2.3.2 Skehan (1998, 2001)
  • 2.3.3 Bachman (2002)
  • 2.3.4 Fulcher (2003)
  • 2.4 Rating scales in performance assessment
  • 3 Rating scales
  • 3.1 General characteristics
  • 3.2 Types of rating scales
  • 3.3 Theoretical and methodological concepts in rating scale development
  • 3.3.1 Intuitive approaches
  • 3.3.2 Theory-based approaches
  • 3.3.3 Empirical approaches
  • 3.3.4 Triangulation of approaches
  • 3.4 Controversy over rating scales
  • 4 Rating scale validation
  • 4.1 Validity and validity evidence
  • 4.2 Rasch-based rating scale validation
  • 4.3 Dimensionality
  • 4.4 Conclusion
  • 5 The ELTT rating scales
  • 5.1 The development process
  • 5.1.1 Intuitive phase
  • 5.1.2 Qualitative phase
  • 5.2 The ELTT construct
  • 5.2.1 Lexico-grammatical resources and fluency
  • 5.2.2 Pronunciation and vocal impact
  • 5.2.3 Structure and content
  • 5.2.4 Genre-specific presentation skills: formal presentations
  • 5.2.5 Content and relevance (interaction)
  • 5.2.6 Interaction
  • 5.3 Descriptor formulation
  • 5.4 ELTT speaking ability
  • 5.5 Conclusion
  • 6 Descriptor sorting
  • 6.1 Validating the ELTT scales
  • 6.2 Rationale
  • 6.3 Methodology
  • 6.3.1 Participants
  • 6.3.2 Instruments and procedures
  • 6.4 Analysis
  • 6.5 Results and discussion
  • 6.5.1 Inter-rater reliability
  • 6.5.2 Match between intended and empirical scale
  • 6.5.3 Descriptor analysis
  • 6.6 Preliminary conclusions
  • 6.6.1 Level allocation
  • 6.6.2 Specificity of proficiency levels
  • 6.6.3 Descriptor wording
  • 6.6.4 Recommendations for scale revision
  • 6.7 Conclusion
  • 7 Descriptor calibration
  • 7.1 Rationale
  • 7.2 Analysis
  • 7.2.1 Rasch measurement
  • 7.2.2 Specification of a measurement model and FACETS output
  • 7.2.3 Measurement quality control
  • 7.2.4 Descriptor analysis
  • 7.3 Results and discussion
  • 7.3.1 Measurement quality control
  • 7.3.2 Dimensionality of descriptors
  • 7.3.3 The proficiency continuum
  • 7.3.4 Cut-off points and content integrity
  • 7.4 Conclusion
  • 8 Descriptor-performance matching
  • 8.1 Rationale
  • 8.2 Methodology
  • 8.2.1 Participants
  • 8.2.2 Instruments and procedures
  • 8.2.3 Data collection
  • 8.3 Analysis
  • 8.3.1 Specification of a measurement model
  • 8.3.2 Measurement quality control
  • 8.4 Results and discussion
  • 8.4.1 Measurement quality control
  • 8.4.2 Dimensionality of descriptors
  • 8.4.3 The proficiency continuum
  • 8.4.4 Cut-off points and content integrity
  • 8.5 Conclusion
  • 8.6 Comparison of methods
  • 9 Revision of the ELTT scales
  • 9.1 Establishing a quality hierarchy of descriptor units
  • 9.2 The quality of descriptor units
  • 9.3 Constructing the revised scales
  • 9.4 Common points of reference
  • 9.5 The modified versions of the ELTT scales
  • 10 Conclusion
  • 10.1 Summary
  • 10.2 Theoretical implications
  • 10.3 Practical recommendations
  • 10.4 Limitations of the study
  • 10.5 Suggestions for further research
  • 10.6 Concluding statement
  • 11 References
  • 12 Appendix
  • 12.1 Appendix 1: Original ELTT rating scales
  • 12.2 Appendix 2: Sorting task questionnaire
  • 12.3 Appendix 3: Consensual scales based on descriptor sorting
  • 12.4 Appendix 4: Descriptor unit measurement report (descriptor calibration)
  • 12.5 Appendix 5: All facet vertical ruler (sorting task)
  • 12.6 Appendix 6: Speaking tasks
  • 12.7 Appendix 7: Rating sheets
  • 12.8 Appendix 8: Rater guidelines
  • 12.9 Appendix 9: Student measurement report (descriptor-performance matching)
  • 12.10 Appendix 10: All facets vertical ruler (descriptor-performance matching)
  • 12.11 Appendix 11: Descriptor unit measurement report (descriptor-performance matching)

Acknowledgements

I would like to express my sincere gratitude to all those – far too numerous to mention here – who supported me during my academic journey. In particular, I wish to thank Christiane Dalton-Puffer, Günther Sigott, Tim McNamara, Charles Alderson, Ari Huhta, Rita Green, and Hermann Cesnik for the opportunity to discuss my work with them. Their insightful, instructive, and wholly useful feedback helped me shape this research. The responsibility for any errors or inadequacies that may occur in this work, of course, is entirely my own.

Thank you for sharing your great expertise!

Furthermore, I would like to express my gratitude to the members of the ELTT group who developed the two analytic rating scales I was fortunate enough to investigate: Martina Elicker, Helen Heaney, Martin Kaltenbacher, Gunther Kaltenböck, Thomas Martinek, and Benjamin Wright. Working with them has been an enjoyable and educational experience.

Thank you for your commitment to professionalism!

I am deeply indebted to my colleagues who participated as raters in the project: Nancy Campbell, Lucy Cripps, Dianne Davies, Grit Frommann, Meta Gartner-Schwarz, Anthony Hall, Helen Heaney, Claire Jones, Katharina Jurovsky, Gunther Kaltenböck, Christina Laurer, Sandra Pelzmann, Michael Phillips, Horst Prillinger, Karin Richter, Angelika Rieder-Bünemann, Jennifer Schumm Fauster, Gillian Schwarz-Peaker, Nicholas Scott, Susanne Sweeney-Novak, Andreas Weissenbäck, and Sarah Zehentner. I greatly appreciate their willingness to share their expertise and devote time – often enormous amounts – to the project for nothing but sincere gratitude in return.

Thank you for your academic idealism!

I would also like to thank all our students who generously consented to take part in the study. The spectacle of a mock exam and the doubtful privilege of being able to consider themselves participants in a study was a poor reward for real motivation and great service.

Thank you for your academic curiosity! ← 9 | 10 →

On a personal note, I am extremely fortunate to have had the wholehearted love and support of my family and friends. It was their patience and understanding that helped me manage to juggle a full-time teaching job, a research project, and many other professional activities. Words cannot describe the gratitude I feel towards my wife, Angela, who is the greatest source of inspiration in my life, bar none.

Sorry for not always having my priorities right! ← 10 | 11 →

List of figures

Figure 1:    Components of language competence (Bachman 1990: 87)

Figure 2:    Components of language competence (Bachman & Palmer 1996: 63)

Figure 3:    Levelt’s blueprint for the speaker (Levelt 1989: 9)

Figure 4:    A summary of oral skills (Bygate 1987: 50)

Figure 5:    Variables influencing performance in a speaking test (McNamara 1996: 86)

Figure 6:    Skehan’s (1998: 172) model of oral test performance

Figure 7:    Bachman’s (2002: 467) expanded model of oral test performance

Figure 8:    Fulcher’s (2003: 115) expanded model of speaking testperformance

Figure 9:    A framework for describing approaches to rating scaledevelopment

Figure 10:  Messick’s (1989: 20) facets of validity

Figure 11:  Facets of rating scale validity (Knoch 2009: 65)

Figure 12:  The ELTT scale development process

Figure 13:  The ELTT model of speaking ability

Figure 14:  Scale category probability curves (descriptor sorting)

Figure 15:  Task specifications

Figure 16:  Scale category probability curves (descriptor-performance matching)

Figure 17:  Classification instrument for assessing descriptor unit quality

Figure 18:  Common reference points and descriptor keywords

Figure 19:   An expanded model of performance assessment, based on Fulcher (2003) and Knoch (2009)

Figure 20:  An expanded model for rating scale development ← 11 | 12 → ← 12 | 13 →

List of tables

Table 1:

Inter-rater reliability statistics

Table 2:

Discriminant analysis: classification results

Table 3:

Discriminant analysis: classification results according to scale criteria

Table 4:

Unilevel descriptor units with agreement figures of < 60 % in the sorting task

Table 5:

Multi-level descriptor units with agreement figures of > 60 % in the sorting task

Table 6:

Rater measurement report (descriptor sorting)

Table 7:

Criterion measurement report (descriptor sorting)

Table 8:

Category statistics (descriptor sorting)

Table 9:

Misfitting LGF descriptor units (descriptor calibration)

Table 10:  

Unexpected calibrations within lexico-grammatical resources and fluency (descriptor calibration)

Table 11:

Unexpected calibrations within pronunciation and vocal impact (descriptor calibration)

Table 12:

Unexpected calibrations within structure and content (descriptor calibration)

Table 13:

Unexpected calibrations within content and relevance (descriptor calibration)

Table 14:

Synopsis of calibrated descriptor components: LGF (descriptor calibration)

Table 15:

Synopsis of calibrated descriptor components: PVI (descriptor calibration)

Table 16:

Synopsis of calibrated descriptor components: PSCW (descriptor calibration)

Table 17:

Synopsis of calibrated descriptor components: PGSP (descriptor calibration)

Table 18:

Synopsis of calibrated descriptor components: ICRW (descriptor calibration)

Table 19:

Synopsis of calibrated descriptor components: IINH (descriptor calibration)

Table 20:

Number of videotaped speaking performances

Table 21:

Rater measurement report (descriptor-performance matching)

Table 22:

Criterion measurement report (descriptor-performancematching)

Table 23:

Category statistics (descriptor-performance matching) ← 13 | 14 →

Table 24:

Misfitting LGF descriptor units (descriptor-performance matching)

Table 25:

Synopsis of calibrated descriptor components: LGF (descriptor-performance matching)

Table 26:

Synopsis of calibrated descriptor components: PVI (descriptor-performance matching)

Table 27:

Synopsis of calibrated descriptor components: PSCW (descriptor-performance matching)

Table 28:

Synopsis of calibrated descriptor components: PGSP (descriptor-performance matching)

Table 29:

Synopsis of calibrated descriptor components: ICRW (descriptor-performance matching)

Table 30:

Synopsis of calibrated descriptor components: IINH (descriptor-performance matching)

Table 31:

Consistency and consensus indices of measures and band allocations

Table 32:

Illustrative quality classifications

Table 33:

Distribution of descriptor unit quality

Table 34:

ELTT descriptor units of excellent quality

Table 35:

The ELTT presentation scale after reintegrating the most stable descriptor units

Table 36:

The ELTT interaction scale after reintegrating the most stable descriptor units

Table 37:

Descriptor units added for adequate construct representation

Table 38:

Presentation scale

Table 39:

List of abbreviations

Summary

This book presents a unique inter-university scale development project, with a focus on the validation of two new rating scales for the assessment of academic presentations and interactions. The use of rating scales for performance assessment has increased considerably in educational contexts, but the empirical research to investigate the effectiveness of such scales is scarce. The author reports on a multi-method study designed to scale the level descriptors on the basis of expert judgments and performance data. The salient characteristics of the scale levels offer a specification of academic speaking, adding concrete details to the reference levels of the Common European Framework. The findings suggest that validation procedures should be mapped onto theoretical models of performance assessment.

Details

Pages
395
ISBN (PDF)
9783653061833
ISBN (ePUB)
9783653960426
ISBN (MOBI)
9783653960419
ISBN (Book)
9783631666913
Language
English
Publication date
2015 (December)
Published
Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, 2015. 395 pp., 39 tables

Biographical notes

Armin Berger (Author)

Armin Berger is a Senior Lecturer in English as a Foreign Language in the English Department at the University of Vienna. His main research interests are in the areas of teaching and assessing speaking, rater behaviour, language assessment literacy, and foreign language teacher education.

Previous

Title: Validating Analytic Rating Scales