Validating Analytic Rating Scales

A Multi-Method Approach to Scaling Descriptors for Assessing Academic Speaking

by Armin Berger (Author)
Thesis 395 Pages

Table Of Content


I would like to express my sincere gratitude to all those – far too numerous to mention here – who supported me during my academic journey. In particular, I wish to thank Christiane Dalton-Puffer, Günther Sigott, Tim McNamara, Charles Alderson, Ari Huhta, Rita Green, and Hermann Cesnik for the opportunity to discuss my work with them. Their insightful, instructive, and wholly useful feedback helped me shape this research. The responsibility for any errors or inadequacies that may occur in this work, of course, is entirely my own.

Thank you for sharing your great expertise!

Furthermore, I would like to express my gratitude to the members of the ELTT group who developed the two analytic rating scales I was fortunate enough to investigate: Martina Elicker, Helen Heaney, Martin Kaltenbacher, Gunther Kaltenböck, Thomas Martinek, and Benjamin Wright. Working with them has been an enjoyable and educational experience.

Thank you for your commitment to professionalism!

I am deeply indebted to my colleagues who participated as raters in the project: Nancy Campbell, Lucy Cripps, Dianne Davies, Grit Frommann, Meta Gartner-Schwarz, Anthony Hall, Helen Heaney, Claire Jones, Katharina Jurovsky, Gunther Kaltenböck, Christina Laurer, Sandra Pelzmann, Michael Phillips, Horst Prillinger, Karin Richter, Angelika Rieder-Bünemann, Jennifer Schumm Fauster, Gillian Schwarz-Peaker, Nicholas Scott, Susanne Sweeney-Novak, Andreas Weissenbäck, and Sarah Zehentner. I greatly appreciate their willingness to share their expertise and devote time – often enormous amounts – to the project for nothing but sincere gratitude in return.

Thank you for your academic idealism!

I would also like to thank all our students who generously consented to take part in the study. The spectacle of a mock exam and the doubtful privilege of being able to consider themselves participants in a study was a poor reward for real motivation and great service.

Thank you for your academic curiosity! ← 9 | 10 →

On a personal note, I am extremely fortunate to have had the wholehearted love and support of my family and friends. It was their patience and understanding that helped me manage to juggle a full-time teaching job, a research project, and many other professional activities. Words cannot describe the gratitude I feel towards my wife, Angela, who is the greatest source of inspiration in my life, bar none.

Sorry for not always having my priorities right! ← 10 | 11 →

List of figures

Figure 1:    Components of language competence (Bachman 1990: 87)

Figure 2:    Components of language competence (Bachman & Palmer 1996: 63)

Figure 3:    Levelt’s blueprint for the speaker (Levelt 1989: 9)

Figure 4:    A summary of oral skills (Bygate 1987: 50)

Figure 5:    Variables influencing performance in a speaking test (McNamara 1996: 86)

Figure 6:    Skehan’s (1998: 172) model of oral test performance

Figure 7:    Bachman’s (2002: 467) expanded model of oral test performance

Figure 8:    Fulcher’s (2003: 115) expanded model of speaking testperformance

Figure 9:    A framework for describing approaches to rating scaledevelopment

Figure 10:  Messick’s (1989: 20) facets of validity

Figure 11:  Facets of rating scale validity (Knoch 2009: 65)

Figure 12:  The ELTT scale development process

Figure 13:  The ELTT model of speaking ability

Figure 14:  Scale category probability curves (descriptor sorting)

Figure 15:  Task specifications

Figure 16:  Scale category probability curves (descriptor-performance matching)

Figure 17:  Classification instrument for assessing descriptor unit quality

Figure 18:  Common reference points and descriptor keywords

Figure 19:   An expanded model of performance assessment, based on Fulcher (2003) and Knoch (2009)

Figure 20:  An expanded model for rating scale development ← 11 | 12 → ← 12 | 13 →

List of tables

Table 1:

Inter-rater reliability statistics

Table 2:

Discriminant analysis: classification results

Table 3:

Discriminant analysis: classification results according to scale criteria

Table 4:

Unilevel descriptor units with agreement figures of < 60 % in the sorting task

Table 5:

Multi-level descriptor units with agreement figures of > 60 % in the sorting task

Table 6:

Rater measurement report (descriptor sorting)

Table 7:

Criterion measurement report (descriptor sorting)

Table 8:

Category statistics (descriptor sorting)

Table 9:

Misfitting LGF descriptor units (descriptor calibration)

Table 10:  

Unexpected calibrations within lexico-grammatical resources and fluency (descriptor calibration)

Table 11:

Unexpected calibrations within pronunciation and vocal impact (descriptor calibration)

Table 12:

Unexpected calibrations within structure and content (descriptor calibration)

Table 13:

Unexpected calibrations within content and relevance (descriptor calibration)

Table 14:

Synopsis of calibrated descriptor components: LGF (descriptor calibration)

Table 15:

Synopsis of calibrated descriptor components: PVI (descriptor calibration)

Table 16:

Synopsis of calibrated descriptor components: PSCW (descriptor calibration)

Table 17:

Synopsis of calibrated descriptor components: PGSP (descriptor calibration)

Table 18:

Synopsis of calibrated descriptor components: ICRW (descriptor calibration)

Table 19:

Synopsis of calibrated descriptor components: IINH (descriptor calibration)

Table 20:

Number of videotaped speaking performances

Table 21:

Rater measurement report (descriptor-performance matching)

Table 22:

Criterion measurement report (descriptor-performancematching)

Table 23:

Category statistics (descriptor-performance matching) ← 13 | 14 →

Table 24:

Misfitting LGF descriptor units (descriptor-performance matching)

Table 25:

Synopsis of calibrated descriptor components: LGF (descriptor-performance matching)

Table 26:

Synopsis of calibrated descriptor components: PVI (descriptor-performance matching)

Table 27:

Synopsis of calibrated descriptor components: PSCW (descriptor-performance matching)

Table 28:

Synopsis of calibrated descriptor components: PGSP (descriptor-performance matching)

Table 29:

Synopsis of calibrated descriptor components: ICRW (descriptor-performance matching)

Table 30:

Synopsis of calibrated descriptor components: IINH (descriptor-performance matching)

Table 31:

Consistency and consensus indices of measures and band allocations

Table 32:

Illustrative quality classifications

Table 33:

Distribution of descriptor unit quality

Table 34:

ELTT descriptor units of excellent quality

Table 35:

The ELTT presentation scale after reintegrating the most stable descriptor units

Table 36:

The ELTT interaction scale after reintegrating the most stable descriptor units

Table 37:

Descriptor units added for adequate construct representation

Table 38:

Presentation scale

Table 39:

List of abbreviations

1  Introduction

Although rating scales as operationalisations of speaking test constructs are extensively used in oral performance assessment, they often lack empirical validation. One of the main concerns is that rating scales which were developed intuitively based on expert judgements fail to represent the progression of speaking proficiency in an appropriate way (Brindley 1998; Hulstijn 2007; Kramsch 1986; Lantolf & Frawley 1985; Savignon 1985). It is far from clear whether or not, and if, to what extent, rating scales in fact describe an implicational continuum of increasing language proficiency that corresponds to real language use. Accordingly, there is need for extensive research in the field of educational language testing to show that the speaking constructs and their operationalisations in rating scales are related to the reality of language use (Kaftandjieva & Takala 2003). The study reported in this book sought to do exactly that. It investigated the operationalisation of a speaking construct in two analytic rating scales developed by the Austrian English Language Teaching and Testing (ELTT) initiative, aiming to ascertain whether the level descriptors actually represent an incremental pattern of increasing speaking proficiency. The introductory chapter of this book first describes the background to the research. Then it goes on to state the problem and purpose of the study. Finally, it lists the major research questions and outlines the structure of the book.

1.1  Background to the study

The research presented here grew out of a unique inter-university construct definition project initiated by the Language Testing Centre (LTC) at Klagenfurt University in cooperation with the Department of English at the University of Vienna back in 2008. At that time, speaking classes at Austrian university English departments differed widely in terms of both the number required in the compulsory course programmes and their specific nature and orientation. Not surprisingly then, the language testing practices were as varied as the courses themselves, with most testing procedures generally being restricted to measuring achievement of specific course-related objectives. The BA programme at the Department of English at Salzburg University, for example, offered a number of speaking-related classes, including Pronunciation and Intonation, Listening and Speaking, Communication and Culture, Presentation Skills, and Discussion and Debate. Formal assessment in most of these courses was based on course partici ← 17 | 18 → pation and ongoing assessment but did not involve a final speaking exam. The University of Graz, by comparison, offered courses entitled Language Production Skills, Advanced Language Production Skills, Professional Presentation Skills, Pronunciation, and Advanced Pronunciation. While assessment in the latter two was based on a final oral exam, all other courses assessed speaking in an integrated way together with other language skills. The BA programme at the University of Klagenfurt included Pronunciation, Speaking I: Presentations, and Speaking II: Professional and Social Interaction, all of which required an oral exam at the end of the semester. Innsbruck offered four consecutive courses (Listening/Speaking I-IV), with a focus on interaction and fluency, expressing viewpoints, expressing ideas and opinions in social and professional contexts, and conversation and discussion, respectively. The Department of English at the University of Vienna, finally, offered two speaking-related classes. While Practical Phonetics and Oral Communication Skills 1 (PPOCS 1) was intended to teach the main aspects of English pronunciation at both the segmental and suprasegmental levels, Practical Phonetics and Oral Communication Skills 2 (PPOCS 2) focused on interactive speaking skills and formal presentations. Assessment in both courses was mainly based on a final oral exam.

In view of this variety, it can be said that although speaking ability was being assessed in Austria at tertiary level, standardisation had been largely missing. Language testing in the English Studies programmes at Austrian universities had been a fairly independent and isolated undertaking of individual teachers. Although standardised language programmes with common course curricula had existed at each department, standardisation of testing practices was rare; lecturers generally designed their own instruments to test students’ achievement of specific course-related objectives. Test content, format, and assessment criteria varied from teacher to teacher, who had to rely almost exclusively on their own testing experience. There may have been some common guidelines regarding examination procedures, such as a double marking policy, but specifications for these examinations rarely existed.

A notable exception was and still is the oral exam at the end of the PPOCS 2 course at the English Department in Vienna. The construct, content, test methods, and scoring procedures are standardised and explicated in the test specifications, which state that candidates are being tested on their ability to deliver a clear and effective presentation and their ability to interact with several interlocutors to construct conjointly conversational discourse. As for the task format, students are required to give a five-minute oral presentation individually and then engage in a 15-minute spoken interaction with three other students. While the former ← 18 | 19 → is a condensed and improved version of the in-class presentation, the latter takes the form of unrehearsed role play with role cards, involving a discussion of a controversial topic with clear goals for each participant. Two raters award separate scores for presentations and interactions independently from each other by means of analytic rating scales. The final grade is an overall average of each analytical category score awarded by the two raters.

Most programmes, however, tended to focus on testing achievement of course-specific objectives, while at the same time neglecting proficiency-based certification. None of the Austrian English departments required an oral exit-level proficiency exam at the end of the BA programme irrespective of a particular syllabus, with the exception of Klagenfurt, where the final exam (Fachprüfung) involved a presentation of a topic for one minute plus a following interaction with other students on this topic. Vienna used to administer a spoken component (viva) in the Exit Level Language Assessment (ELLA), which had been developed with a view to certifying a minimum level of speaking proficiency corresponding to C1 in the Common European Framework of Reference for Languages (CEFR) (CoE 2001). However, this oral component of ELLA has recently been dropped owing to resource constraints.

Much as the testing practices have lacked standardisation and validity in many respects, there has been “an unprecedented move toward professionalization in language assessment” in Austria recently (Dalton-Puffer et al. 2011: 200). With a general movement towards more transparency in the educational system and the demand for international comparability as to language proficiency, the testing scene in Austria has begun to change in recent years. Most notably, national educational standards have been developed and implemented at secondary level, a process which “is also leaving its mark at tertiary level” (Dalton-Puffer et al. 2011: 202). Major structural changes have lead to a growing demand for stan­dardisation of language testing in university contexts. As most curricula at Austrian universities have been converted into separate bachelor and master programmes, such restructuring of the system has also raised the question as to what level of proficiency a BA graduate in English language and literature should have acquired.

It was particularly the LTC at Klagenfurt University and its ELTT initiative who promoted concerted action to professionalise and standardise language assessment practices at Austrian university English departments. This working group consisting of applied linguists and language teaching experts from the Universities of Graz, Klagenfurt, Salzburg, and Vienna pursued the goal of more professional assessment practices for high-stakes examinations at Austrian university English language departments, focusing on the certification of language proficiency at ← 19 | 20 → the end of the BA programmes. Seeking to remedy some of the shortcomings in everyday assessment practice, they set out to develop a common analytic rating scale for writing and benchmarked writing performances in 2007. Such instruments were intended for university teachers and students alike. While the former were meant to use them in local markers’ meetings and rater trainings, the latter could refer to them if they wanted to familiarise themselves with the level of performance expected at the end of their studies. With such initiatives, language testing ceased to be an isolated, solitary activity of individual teachers and started to become more of a group endeavour in which language teachers cooperated.

While there has been some more work on assessing reading (Heaney 2011) and listening skills (Schuller 2008), the nature of speaking ability at tertiary level and the question of measuring it still requires attention in theory and practice. As a first step, Berger (2009) presented an in-house rating scale for the PPOCS 2 exam at the Department of English in Vienna. At about the same time, the ELTT group set out to define and operationalise a test construct for the certification of speaking proficiency at the end of the BA programmes. The outcome of this project was a set of analytic rating scales, including one for assessing academic presentations and one for assessing discussion-based interactions, both of which encompass levels C1 and C2 according to the CEFR. This set of analytic rating scales is the main subject of this book.

1.2  Statement of the problem

The approach to rating scale development taken by the ELTT group was a combination of the three main approaches: It was (a) intuitive in the sense that university language teachers, drawing on their expertise and experience, identified those aspects of speaking ability they considered crucial for the intended purpose and target level. It was (b) theory-based in that the selection of criteria was aligned with the theoretical framework presented in the CEFR. Anchor descriptors were extracted from the CEFR and descriptor wordings in a large part modelled on the illustrative formulations at levels C1 and C2. While some ELTT descriptors closely resembled the CEFR formulations, others were unique to the ELTT scales. Finally, the approach was (c) data-informed in the sense that actual speaking performances shaped the wording of the descriptors. Student performances were used to check whether the speech phenomena deemed characteristic of a particular level were actually observable in real speech representing the potential range of test-taker levels. In addition, observing and analysing real performances helped the scale developers to revise and modify the draft descriptors in a meaningful way. ← 20 | 21 →

Creating such rating scales, however, does not suffice; they also need to be subjected to a process of validation. That is, the assessment instruments themselves are scrutinised to uncover their potential weaknesses. One of the main concerns is that the hierarchies of the skills and abilities implied in the rating scales lack validity. It is far from clear whether or not the rating scales in fact describe an implicational continuum, i.e. an incremental pattern of increasing language proficiency. Although the ELTT group used a methodologically triangulated approach to scale construction in order to increase the validity and reliability of the procedure, the scales cannot be assumed to form implicational scales a priori. Accordingly, there is need for research to show that the progression described in the rating scales corresponds to the reality of language use. In other words, the hierarchy of the abilities characterising the scale levels may not be assumed but must be established through investigation.

1.3  Purpose of the study

The main objective of this study was to investigate the descriptors that make up the ELTT rating scales (presentation and interaction) and ascertain whether they actually represent an incremental pattern of increasing speaking proficiency. To this end, a multi-method approach to the validation of the scales was designed. In phase one, which was termed descriptor sorting and conducted as part of a preliminary study (Berger 2012), expert teachers assigned each descriptor unit to a particular level of speaking proficiency. The purpose of this phase was to find out whether experienced university teachers were able to reconstruct the intended hierarchy of the level descriptors, and thus intuitively establish an underlying pattern of increasing speaking proficiency, thereby providing qualitative validity evidence. In phase two, which is referred to as descriptor calibration, the sorting task data was subjected to multi-faceted Rasch analysis so as to obtain a calibrated scale of descriptor units. In phase three, finally, the descriptor units were related to samples of real speech, hence the label descriptor-performance matching, again with a view to obtaining a calibrated scale of descriptor units. Eight experienced university teachers observed a total of 153 videotaped speaking performances to assess how well each scale descriptor represented a particular performance. These ratings were then Rasch-analysed to obtain difficulty estimates for each descriptor unit. The ultimate practical aim of the study was to identify and eradicate any flawed scale components and to suggest revised versions of the ELTT scales.

At a more general level, the purpose of this study was to make a significant step towards specifying advanced speaking proficiency. The results may help to flesh out and add substance to the scarcely defined proficiency descriptions of the ← 21 | 22 → CEFR at levels C1 and C2. The psychometrically most stable ELTT descriptors are considered to have the potential to complement the CEFR descriptors and provide finer meaningful and concrete distinctions within the higher levels. Finally, the study aimed to assess the impact of the methodology on the hierarchy of descriptor units and draw conclusions about the usefulness of the scale development and validation methodology for future projects. In other words, the study sought to compare the three validation procedures employed here and determine whether they can be used interchangeably in similar (small-scale) projects.

1.4  Research questions

The following research questions are addressed in this study:

1.  To what extent do the descriptors of the ELTT speaking scales (presentation and interaction) define a continuum of increasing speaking proficiency?

     a.  To what extent does the ELTT speaking construct represent a single (psychometric) dimension?

     b.  What does the empirical hierarchy of the ELTT scale descriptors look like?

     c.  Are there any natural gaps and groupings on the vertical scale of descriptors that would suggest cut-off points to allow equidistant bands?

2.  Does the validation methodology have an effect on the hierarchy of descriptor units?

3.  Which rating scale descriptors are the most effective?

The central hypotheses underlying the research were that (1) while most ELTT rating scale descriptors were expected to form an appropriate continuum of increasing speaking proficiency, some would prove dysfunctional, and (2) while the empirical order of most ELTT descriptor units was expected to correspond to the intended order as conceived of by the ELTT group, some would turn up at different levels.

The present study can thus be considered a construct validation study that is methodologically related to the scaling approach most notably associated with the work of North (1995, 2000, 2002) and North and Schneider (1998) in the context of developing a common European framework for reporting language competency, which resulted in the illustrative scales of the CEFR. The present project is conceptually different, however, in the significant respect that it investigated a speaking construct operationalised by a group of experts in a customised scale development project aiming to meet the specific needs of tertiary speaking assessment. Unlike North, who drew on a pool of descriptors from a number of existing rating scales in order to develop proficiency descriptions spanning the ← 22 | 23 → entire range from the lowest level of generative language use to complete mastery, the present study aimed to validate a pre-defined speaking construct covering the most advanced levels, C1 and C2, only.

1.5  Structure of the book

This book is organised into ten chapters. The introductory chapter has just provided a general overview of the project; the context and purpose of the study have been adumbrated and the major research questions presented. The following three chapters provide a review of the relevant literature. Chapter two, which introduces the notion of performance assessment of second language speaking, begins by discussing different ways of conceptualising the speaking construct and theoretical models of performance assessment, and then looks at the question of operationalising the speaking construct in rating scales. Chapter three goes on to describe general characteristics and types of rating scales. It addresses theoretical and methodological issues in rating scale development and outlines the controversy over rating scales, particularly from the perspective of second language acquisition (SLA). Chapter four reviews the relevant literature on rating scale validation, especially within the realm of Rasch measurement, and concludes by highlighting the need for extensive validation of performance assessment instruments. Chapter five describes in detail the development of the ELTT rating scales and reflects on the underlying speaking construct. The following chapters present the empirical part of the study. Chapters six, seven, and eight deal with the descriptor sorting, the descriptor calibration, and the descriptor-performance matching procedures, respectively. Each of these chapters describes the specifics of the procedure, discusses the key findings, and presents some preliminary conclusions. Chapter nine draws upon the previous chapters and synthesises the findings from all three procedures with a view to revising the original versions of the ELTT scales. Finally, the concluding chapter gives a brief summary and critique of the key findings. It discusses theoretical implications, offers practical recommendations, and identifies areas for further research. ← 23 | 24 → ← 24 | 25 →

2  Performance assessment of second language speaking

The following three chapters provide the theoretical context of the study, including the relevant literature on performance assessment in general and rating scale development and validation in particular. Chapter two outlines performance assessment of second language speaking with a particular focus on rating scales as essential components in this process. After an introductory section that begins with some terminological clarification and gives a brief overview of the historical development of language performance assessment, chapter two examines how speaking a second language under performance conditions has been conceptualised by eminent writers in the field. It provides a chronological and comparative account of the different ways of describing speaking ability in performance assessment, including Carroll’s (1961) integrative approach, which is based on Lado’s (1961) skills and elements model, and models in the tradition of communicative language competence, most notably Canale and Swain’s (1980) model and Bachman and Palmer’s (1996) more comprehensive account of communicative language ability. The focus of this part is on the different ways in which performance has been conceptualised.

After discussing different approaches to describing the speaking construct, the literature review proceeds to models of performance assessment, which have been suggested in recent years to account for variables in addition to speaking ability that may shape performance. The approach is comparative and critical again, drawing attention to the conceptualisations of rating scales and their effects on performance assessment. It will be argued that the scoring instruments constitute an essential component in the assessment process, whose capacity to represent a performance in a direct and unequivocal way is questionable if not downright spurious. Instead, language performance is being filtered by rating scales, which may at times misconstrue the interpretations of the scores. Therefore, a critical investigation of the rating instruments in operation, both theoretical and empirical, is called for.

Chapter three focuses on rating scales as operationalisations of speaking constructs, providing a detailed overview of the general characteristics of rating scales, a typology, and theoretical and methodological issues in rating scale development. It also takes up a critical stance on the concept of proficiency scales, delineating the pertinent controversy from the perspective of second language acquisition (SLA). ← 25 | 26 →

In the final chapter of the literature review, the focus shifts to rating scale validation. It introduces seminal work in the context of Rasch measurement, reviews the more recent publications, and identifies the research gap that the present study seeks to close. It will be concluded that while rating scale validation, for practical and theoretical reasons, has tended to focus on the reliable application of the scales by raters in operational settings, the conceptual basis of rating scales in terms of the performance features they represent has often been neglected.

2.1  Introduction to performance assessment

Information about the speaking proficiency of second language learners is not only useful but sometimes even necessary in educational contexts. Without appropriate feedback on how well a learner can speak the target language, it may be difficult to reach pedagogically sound decisions, for example, in terms of lesson planning, directing or redirecting the teaching process according to individual needs, or rewarding achievement of learning objectives. In higher education, in particular, university language departments may want to measure achievement of course-related aims in order to award credits for the satisfactory completion of an examination module or certify speaking proficiency at the end of the study programme. Such educational measures require reliable information about how well students can speak the target language they are studying.

The ways and means to obtain the relevant information about a student’s speaking proficiency, however, have changed profoundly over the last decades. Historically speaking, there were formative influences from both within and outside education that effected considerable advancements in the field of speaking assessment. One crucial turning point came in the wake of World War II when the United States Federal Government recognised the practical need to equip their personnel taking up foreign postings with the necessary speaking skills. This political and military need prompted the authorities to establish the Foreign Service Institute (FSI), which in turn introduced their seminal test of speaking in the 1950s: the FSI Oral Proficiency Interview (OPI) – one of the first large-scale tests of functional speaking proficiency. Another strain of influence derived from a perceived need in educational contexts in the 1960s to find appropriate ways of ascertaining how well speakers can use the target language. In vocational education, workforce mobility intensified the demand for certificates of language skills, while in academia, Anglophone universities in particular were facing increasing numbers of incoming students from all over the world wishing to study abroad. As a result, components of performance assessment were introduced into the language tests as part of the candidates perform the task, the focus is on the← 26 | 27 →

With the emergence of the communicative approach to language teaching in the 1970s and 1980s at the latest, indirect forms of measuring speaking ability and discrete-point tests, such as traditional paper-and-pencil tests in which the candidates were asked to identify phonological features (Lado 1961), had been largely replaced by performance assessment, in which candidates were required to demonstrate that they had acquired the relevant speaking skills and competencies by using language in an act of communication. The communicative approach, which rests on the idea that languages are learnt most effectively through real and meaningful communication, together with the underlying notion of communicative competence have formed part of the theoretical rationale behind performance assessment, which continues to be the most common form of speaking assessment to date. More detailed accounts of the history and development of testing second language speaking can be found in Spolsky (1995) and Fulcher (2003).

The different origins of and formative influences on language performance assessment have generated a conceptual distinction between strong and weak performance tests. McNamara (1996) depicts the development of second language performance assessment as the result of two traditions. The work sample tradition involves the application of performance assessment techniques from non-linguistic contexts to second language assessment. It is essentially a pragmatic approach shaped by sociolinguistic theory, in which the performance itself is the target of assessment. For instance, the observation is carried out directly in the workplace under real-life conditions, with or without controlled, standardised or simulated work tasks, and the quality of the performance of the task is of greatest interest. Although language use is involved in performing the tasks, it is merely the medium of the performance. To the extent that this approach involves real-life and non-linguistic assessment criteria, it can be considered performance assessment in a strong sense (McNamara 1996: 43ff). Performance assessment in this sense corresponds to what has more recently been termed a task-centred approach (Brown et al. 2002, Norris et al. 1998), where the focus is on whether candidates can use the language to fulfil a given task. The second tradition, in contrast, is shaped by psycholinguistic theory, considering second language performance as a complex cognitive process. Not the performance itself but the underlying linguistic knowledge and abilities are the prime target; the purpose of performance is mainly to elicit observable language. To the extent that the focus is on language performance rather than performance of the task itself, this approach is considered to reflect performance assessment in a weak sense. Although the assessment criteria may consider aspects of how well the candidates perform the task, the focus is on the ← 27 | 28 → quality of the language. Performance assessment in the weak sense corresponds to what Bachman (2002) termed a construct-centred approach.

It is particularly the latter sense of performance assessment that is based on and informed by theory. In fact, the momentum of performance assessment in the weak sense can be considered a direct response to the theoretical accounts of language ability emerging in the late 1960s and early 1970s. Hymes’s (1972) model of language knowledge and language performance, comprising the abilities underlying instances of actual communication, in which he reinterprets Chomsky’s (1965) renowned distinction between competence and performance, was the spur for language testers to embark on performance assessment. The underlying assumption was that language use in context reveals information about the language knowledge and ability of the candidate being assessed. Accordingly, the validation research aiming to demonstrate the construct validity of performance tests focused strongly on the theoretical rationales behind the tests and attempted to articulate the pertinent set of theoretical concepts and their interrelations, including theoretical accounts of what it means to be able to use a second language. The construct-centred approach to performance assessment and the corresponding construct validation concentrated on definitions of communicative language ability, most notably the notions put forward by Canale and Swain (1980) and Canale (1983) in response to Hymes (1972), and Bachman (1990) and Bachman and Palmer (1996). However, although some of these writers attempted to characterise the role of performance within their models of communicative competence by factoring in instances of real language use as concrete manifestations of the interaction between a speaker’s knowledge and ability for use, the theoretical basis of second language performance assessment and the relationship of performance assessment to theories of communicative ability remained rather weak (McNamara 1996: 49).

It was only the realisation that research on the validity of second language performance assessment needs to refer to a comprehensive model of the relationship between language ability, performance skills, and real communication that prompted leading thinkers in the field to develop theoretical models of performance assessment. The reference to theoretical concepts of communicative competence, let alone the practical appeal and face validity of many performance tests, proved insufficient to do justice to the complex relationship between what is to be observed, the candidate’s response to the task, and the representation of the performance in a score. Instead, theoretical foundations that can explain and predict the interaction between the participants in performance assessment, not just the ability of an individual candidate, were ← 28 | 29 → called for. Furthermore, an explanatory framework for performance assessment would have to identify and establish the significance of any variable in the assessment procedure other than communicative ability that potentially has an impact on the performance, including, for example, the setting, the tasks, and, most relevant to the study at hand, rating instruments. A number of models of performance assessment have been suggested since (McNamara 1996; Skehan 1998; Fulcher 2003; Knoch 2009).

Against this historical backdrop, one can see that the term performance in the context of language assessment has come to mean the requirement in a language test that the candidates be engaged in an extended act of communication, typically productive, sometimes – though less commonly – receptive, or both. This broad understanding of the notion of performance must be distinguished from other interpretations in the field (McNamara 1996: 26). To begin with, in linguistics the term is inseparably associated with Chomsky’s (1965) distinction between competence and performance, and the subsequent discussions in that tradition (Hymes 1972; Canale & Swain 1980), where the former is the intrinsic knowledge of the linguistic system as opposed to what we actually produce as utterances. Secondly, performance can refer to the outcome of a test, as in something performed, similar to some kind of artistic or dramatic presentation, usually before an audience, where the focus is on demonstrating accomplishment or skill. Thirdly, in the narrowest sense, the term has been used exclusively to denote performance of authentic tasks in work sample tests which involve the simulation of real-life contexts and activities. In this book, the term is used in the broad sense, where performance assessment is a form of communicative language assessment in which candidates are required to engage in an act of communication in response to a task similar to the ones they are likely to encounter in real life. At the same time, it should be noted that the different uses of the term performance are not unrelated and mutually exclusive. In fact, performance tests have been shaped by the models and theories of performance in the Chomskyan tradition, and they may – directly or indirectly – have recourse to the latter two meanings (McNamara 1996: 26).

2.2  The speaking construct in performance assessment

2.2.1  Pre-communicative approaches

One of the first explicit ‘models’ for language testing in the pre-communicative era was the structuralist ‘skills and elements’ model proposed by Lado (1961). He resolves the complexity of language by isolating individual segments, i.e. elements ← 29 | 30 → and skills. Elements include pronunciation, grammatical structure, the lexicon, and cultural meanings. Such elements do not appear separately in language but integrated in the skills of speaking, listening, reading, and writing. Lado points out that these elements can be studied, described, and tested separately. In connection to speaking, he suggests to test the “signalling systems of pronunciation, stress, intonation, grammatical structure, and vocabulary”. He argues that such an approach would reduce the influence of external variables such as talkativeness, introversion/extroversion, or the ability to tell interesting stories, and provide “a better coverage of the language itself and more objective scoring” (Lado 1961: 241). The basic assumption from a language testing perspective was that discrete-point items have the potential to reveal the candidate’s ability to control one particular level of language in relation to one of the four skills. The obvious advantage is that such discrete-point tests generate readily quantifiable data. Bachman (2007) points out that the skills and elements approach was perhaps the first one to draw on both linguistic theory and views of language learning and teaching. It explicitly incorporated psychometric notions of reliability and validity, thus providing a conceptual framework for defining the constructs to be tested. However, the greatest weakness of this approach is that it fails to provide measurement of communicative ability. Lado’s atomistic approach to test design rests on the spurious assumption that knowing the elements of a language is the same as knowing the language. The skills and elements model fails to explain how the knowledge of the elements is combined by the language user in various ways to meet the demands of the communicative situation. In other words, the notion of performance does not feature in this structuralist-psychometric view of language testing. Instead, language skills are reduced to knowing in isolation the various elements of the language.

While language tests in Lado’s tradition were essentially indirect, discrete-point tests, it was Carroll (1961) in response to Lado who recommended a performance component in connection with language tests at North American universities for incoming students from abroad. Although Carroll did not challenge Lado’s atomistic structuralist understanding of language, he suggested an “integrative” approach in addition to discrete-point measurement. While acknowledging Lado’s call for specific, carefully selected items testing language knowledge and skill, he also recommended an approach that requires “an integrated, facile performance on the part of the examinee” (Carroll 1961: 37), where there is less focus on specific structural elements but more on the total communicative effect of an utterance. In this view, the purpose of language tests began to feature more prominently in considerations of test design. If the test purpose requires performance, then tests ← 30 | 31 → should include performance on tasks that integrate aspects of language knowledge or skill. In a subsequent discussion, Carroll (1968) went a step further, arguing that performance variables should be taken into account as well. He characterised test tasks in relation to a number of dimensions, including stimulus, response, modality, complexity, and task, and explicitly addressed the relationship between the notions of task, competence, and performance.

2.2.2  Models of communicative competence  On communicative competence (Hymes 1972)

As a response to the lack of attention paid to communicative meaning in the skills and elements approach, applied linguists began to explore language ability from a much broader perspective, leading to the notion of communicative language competence. Many models of communicative competence which have been influential and productive in language performance assessment are based on Hymes’s (1972) notion of language use in social context. Criticising Chomsky’s (1965) distinction between competence and performance as too limited, Hymes suggested that there are competences which go beyond the linguistic domain, proposing that the appropriateness of language use also has to be taken into account. Rather than dealing with an idealised speaker-listener situation in an entirely homogeneous speech community, linguistic inquiry should be concerned with a heterogeneous speech community, in which differential competence and sociocultural features play an important role, and focus on the actual use of language in a concrete situation. Hymes proposed the notion of communicative competence, i.e. the knowledge required to use language in social context, and distinguishes four levels of analysis that are relevant for understanding regularities in people’s use of language. The first level refers to the question of whether and to what degree something is formally possible, i.e. the entirety of all linguistic forms and structures that are possible. The second level is concerned with what is feasible in terms of the means of implementation available. At this level, the focus of attention is on an individual’s time and processing constraints that might impact production and comprehension. Yet another level of analysis pertains to the question of whether and to what degree something is appropriate in relation to the context of use, emphasising the social and situational dimensions of different language use situations. Finally, the last level refers to what is in fact performed, i.e. what is actually done, by convention and habit. While the first level refers to everything that is possible in terms of the language code, the following levels constrict the options realised in actual language use. Of all the possible choices only a limited range is cognitively and practically feasible; of all that is feasible, only some choices are appropriate in a ← 31 | 32 → given situation; and of all that is appropriate, only some choices are actually performed. In other words, Hymes distinguishes between actual instances of language use, on the one hand, and underlying models of the knowledge and ability for use, on the other hand. While the ability for use concerns a person’s potential to realise a possible, feasible, and appropriate speech act, the actual realisation of the speech act itself is what he means by performance. In conclusion, then, Hymes’s notion of performance refers to actual use and actual events. According to Hymes, each of these dimensions is governed by a set of rules of use that first-language users learn. A fully competent user of a language possesses the knowledge of these rules. Much as Hymes’s theory was actually intended for the analysis of first language development, it has been applied to second and foreign language contexts as well.  Communicative competence (Canale & Swain 1980; Canale 1983)

Based on Hymes’s work, Canale and Swain (1980) first proposed a model for what they called ‘communicative competence’ for an L2 context. Probably the most significant feature of their model was their conceptualisation of language knowledge, which they considered as comprising different components in addition to grammatical competence. As for the knowledge side, their model consists of three components: grammatical, sociolinguistic, and strategic, which was later extended by Canale (1983) to include also discourse competence. While grammatical competence is understood to encompass knowledge of grammar, lexis, morphology, syntax, semantics, and phonology, sociolinguistic competence includes (a) the knowledge of the sociocultural rules of language use, which specify the ways in which language is produced and interpreted appropriately with respect to the communicative situation, and (b) rules of discourse, including cohesion and coherence. The former emphasises the appropriateness of language use, a person’s understanding of social relations, and how language use relates to them; the latter refers to the combination of utterances and communicative functions. Strategic competence comes into play when the other competences fail to generate meaning. That is, language users call on strategies to compensate for breakdowns in communication either due to insufficient grammatical or sociolinguistic competence or due to performance variables. The model proposed here is a model of knowledge, which is firmly distinguished from performance, i.e. the actual use of language in real communicative situations or the manifestation of knowledge in actual language performance. In fact, Canale and Swain (1980: 7) expressly omitted considerations of ability for use from their model, arguing that although performance may be shaped by underlying factors and skills, the possibility that any theory ← 32 | 33 → of human action can explain or predict ability for use satisfactorily must be questioned. Rather than modelling ability for use in their theoretical framework, they consider it as part of their notion of communicative performance, which they define as the realisation of the three components of communicative competence and their interaction when actually producing and interpreting real utterances.

While Canale and Swain (1980) affirmed that a theory of performance was not feasible as it would have to allow for all the variables unrelated to linguistic knowledge that may influence communication, Canale (1983) began to explicate such a model. He used the term ‘actual communication’ rather than ‘performance’ to include explicitly psychological and contextual variables, such as memory and perceptual constraints, fatigue, nervousness, distractions, and interfering background noises. In Canale’s (1983: 5) words, “communicative competence refers to both knowledge and skill in using this knowledge when interacting in actual communication”. Knowledge and skill are underlying capacities; actual communication is their manifestation in concrete situations. While Canale adopts the notion of grammatical competence from the 1980 model, he modifies other competences, most notably the notion of sociolinguistic competence, which now refers to sociocultural rules only, encompassing the contextual appropriateness of both meaning and form. The rules of discourse, however, have been turned into a discrete category in its own right, labelled ‘discourse competence’ as the ability to produce a “unified spoken or written text in different genres” (Canale 1983: 9). Furthermore, whereas in the 1980 model strategic competence was compensatory in nature, called on if and when the emerging language system proves deficient, Canale (1983: 11) expands this notion to include strategies that “enhance the effectiveness of communication”.

In comparison to Hymes’s model, the model proposed by Canale (1983) has a broader view of underlying ability and, therefore, assessments based on them allow broader generalisations to a wider range of contexts. Skehan (1998: 159) points out that the constructs of linguistic, sociolinguistic, discourse, and strategic competences represent a more accurate characterisation of a person’s underlying abilities and can be weighted differently to allow a more targeted focus on different language use contexts. However, the model fails to relate underlying abilities to performance and processing conditions, nor does it provide a basis for systematically establishing the language demands of a variety of different contexts. In addition, Fulcher and Davidson (2007: 42) point out that the model leaves unclear what constitutes knowledge and what a skill. It does not explicitly specify the nature of the interaction between the various ← 33 | 34 → components of the model with each other or with the context of language use. These shortcomings are addressed in Bachman’s (1990) model of communicative language ability.  Communicative language ability (Bachman 1990; Bachman & Palmer 1996)

Bachman (1990) further developed and expanded the Canale and Swain approach. What he called ‘communicative language ability’ was considered to have three components: language competence, strategic competence, and psychophysiological mechanisms, which refer to the knowledge of language, the capacity for using language competence in contextualised communicative situations, and the actual realisation of language as a physical phenomenon, respectively. Strategic competence, which is also affected by the knowledge structures (knowledge of the world), is redefined as a “general ability which enables an individual to make the most effective use of available abilities in carrying out a given task” (Bachman 1990: 106).

The component of language competence represents a reorganisation of previous categories based on empirical studies. Bachman distinguishes organisational competence and pragmatic competence. The former entails grammatical and textual competence, i.e. the abilities involved in controlling the formal structure of language. While grammatical competencies include the knowledge of vocabulary, morphology, syntax, and phonology/graphology, textual knowledge comprises cohesion and rhetorical organisation. Pragmatic competence, on the other hand, refers to illocutionary competence, i.e. the knowledge required to perform language functions, and sociolinguistic competence, i.e. the knowledge of the sociolinguistic conventions for performing language functions appropriately in a particular context. Following Halliday (1973), Bachman specifies the functional knowledge dimension as comprising ideational, manipulative, heuristic, and imaginative functions of language use. Finally, the sociolinguistic knowledge component involves the user’s sensitivity to dialect or variety, sensitivity to register, sensitivity to naturalness, and cultural references and figures of speech. Bachman’s components of language competence are presented in Figure 1 below. ← 34 | 35 →

Figure 1:    Components of language competence (Bachman 1990: 87)


What drives language use is strategic competence, i.e. “the mental capacity for implementing the components of language competence in contextualised communicative language use” (Bachman 1990: 84), which is an essential part of all communicative language use, not merely some form of compensation for deficient language abilities. It has a mediation function between the communicative message, the language competences outlined above, background knowledge, and the context of the situation, and has an assessment, a planning, and an execution component. These components will operate differently as is adequate in different situations and have an impact on all communication. In other words, Bachman more comprehensively than anyone else before models aspects of the ability for use in performance, including some cognitive capacities involved in performance, which are different from the knowledge of the language.

Skehan (1998) considers the Bachman model to be a significant advancement of our understanding of communicative competence. In comparison to previous approaches, it is more precise and complex. It describes language competence in more detail and with greater accuracy, representing its own internal organisation. It provides a more complex account of pragmatic knowledge and significantly redefines the status of strategic competence, representing more precisely the ← 35 | 36 → interrelationships between the different component competences. According to Skehan, most importantly, Bachman redefines the relationship between competence and performance. This relationship is characterised as a dynamic one with strategic competence as the central mediator. It is strategic competence which drives communicative language ability in language use. Much as the Bachman model has advanced our understanding of communicative competence, it has its own shortcomings. As Skehan (1998: 165) points out, the model lacks an empirically based rationale grounded in psycholinguistic mechanisms and processes by which such a model can move beyond ‘checklist’ status and make functional statements about the nature of performance and the way it is grounded in competence.

Bachman and Palmer (1996) revised and modified the original Bachman model. According to McNamara (1996) and Celce-Murcia et al. (1995), the most significant changes include the incorporation of affective, non-cognitive factors in language use, the redefinition of ‘knowledge structures’ as ‘topical knowledge’, and the reconceptualisation of strategic competence as a set of metacognitive strategies or “higher order executive processes that provide a cognitive management function in language use, as well as in other activities” (Bachman & Palmer 1996: 70). Language use is conceptualised, on the one hand, as interactions between different characteristics of the individual, including language knowledge, topical knowledge, affective schemata, and personal characteristics, and, on the other hand, as interactions of these features with the characteristics of the language use situation or test task. In actual language use, the knowledge areas and the personal characteristics are mediated by strategic competence. Again, it is this strategic competence which is central to language use. It is defined as a set of metacognitive strategies or higher-order executive processes including goal-setting (“deciding what one is going to do”), assessment (“taking stock of what is needed, what one has to work with, and how well one has done”), and planning (“deciding how to use what one has”) (Bachman & Palmer 1996: 71). Figure 2 below shows the components of language use and language test performance as conceptualised by Bachman and Palmer (1996: 63). ← 36 | 37 →

Figure 2:    Components of language competence (Bachman & Palmer 1996: 63)


For the present discussion, Bachman and Palmer’s model is significant in that it models aspects of ability underlying performance that are not just cognitive but affective and volitional in nature.

In summary, the major models of communicative competence in the tradition of Hymes attempt to conceptualise the components involved in language use for communicative purposes. While earlier models studiously avoided aspects of actual performance, more recent approaches have included with increasing rig ← 37 | 38 → our considerations of the performance dimension, such as cognitive, affective, and volitional factors. Yet such models of communicative competence alone do not seem to be rich enough to conceptualise any factor that could conceivably influence second language performance in a performance test. For one thing, they do not adequately allow for the interactive nature of performance assessment (McNamara 1996: 85). For another, they fail to represent the impact that facets of the test situation may have on the performance. Before the literature review turns to models of performance assessment, which have attempted to enrich our understanding of what is involved in an assessment situation, it discusses different approaches to describing speaking. While the models outlined above aimed to conceptualise communicative ability in general, the following subsection considers attempts to conceptualise speaking ability in particular. The focus is again on what each of these approaches has to offer for our understanding of performance.

2.2.3  Approaches to speaking  Cognitive models of speech production

One strand of theory conceptualises speaking as a cognitive process. One of the most influential models of speaking as information processing was proposed by Levelt (1989), who asserts that developing a theory of any complex cognitive skill requires a reasoned analysis of the system and its subsystems, or processing components, as well as a specification of their interaction in generating their joint product. He postulates a number of processing components, each of which produces some form of output after receiving some form of input. Firstly, the intentional act of speaking entails a number of activities, such as conceiving of an intention, selecting the relevant information, ordering this information, keeping track of what was said before, attending to one’s own production, and monitoring the content and manner of what is being said. All these mental activities form part of a processing system referred to as conceptualiser, which accesses procedural as well as declarative knowledge, including general knowledge of the world and more specific knowledge about the interactional situation, to produce a preverbal message. Planning this preverbal message takes place on a macro level, involving the elaboration of some communicative goal into subgoals and the retrieval of the information to be expressed in order to realise these subgoals, as well as on a micro level, involving the assignment of the appropriate propositional shape to each of these pieces of information and the specification of the informational perspective that will guide the addressee’s allocation of attention. ← 38 | 39 → This output message is in turn the input for the second processing component, the formulator. This formulating component takes up the fragmentary messages and turns them into a phonetic or articulatory plan. That is, a conceptual structure is translated into a linguistic one. This translation is conducted in two stages. First, the message is encoded grammatically, which involves accessing lemmas and syntactic building procedures. The product of this stage is a surface structure. Second, the message is encoded phonologically, the product of which is a phonetic or articulatory plan, which is not yet overt speech but an internal representation of how the planned utterance should be articulated. This output is again the input for the next processing component, the articulator. This is the stage at which internal speech is executed by the articulatory apparatus to become overt speech. The fourth and fifth processing components are part of self-monitoring, which is not unique to speaking but part of general language comprehension. An audition processing component allows speakers to listen to their own overt speech. The speech-comprehension system enables them to interpret their overt speech sounds as meaningful words and sentences. The output of this component is parsed speech, i.e. a representation of the input speech in terms of its phonological, morphological, syntactic, and semantic composition. At the same time, the speech-comprehension system allows speakers to attend to their internal speech, detecting problems before articulation or remedying self-generated form failures apparent in self-corrections. This monitoring attends to both meaning and form of internal or overt speech. As monitoring also takes place before messages are sent to the formulator, this processing component is not considered autonomous in language production, but part of the conceptualiser as well. The processing in this model is largely automatic, incremental, and parallel. That is, as soon as information is passed on to the formulator, the conceptualiser immediately continues to produce the next part rather than waiting until it has gone through the whole system. As a consequence, different parts of the same utterance will be at different stages of the processing system, and different components operate simultaneously. While the greatest attention is paid to conceptualising and some to monitoring, other components function without conscious control. Figure 3 is a visual representation of Levelt’s (1989) model of speaking as information processing. ← 39 | 40 →

Figure 3:    Levelt’s blueprint for the speaker (Levelt 1989: 9)


De Bot (1992) adapted this model for bilingual processing, modifying primarily the notions of the conceptualiser and the formulator. He discusses whether each of the three processes of conceptualisation, formulation, and articulation are language specific, or whether resources at any of the levels are shared across languages. He concludes that only the second of the two production phases in the conceptualiser, the microplanning, is language specific, and that formulators are language specific but draw on a single non-language-specific lexicon. In the formulator, the preverbal message is processed into a speech plan very much in the same way as in Levelt’s model. While de Bot (1992) hypothesised that the conceptualiser is partly language specific and partly language independent, he corrects this view in de Bot (2003) in the light of the recent research to expand the influence of multilingualism to the whole conceptual level. That is to say, the use of various languages seems to have an impact on processing on all levels. A ← 40 | 41 → special role is given to what is called the language node, which is a monitoring device that checks errors in output and is hypothesised to play a role in activating language-specific information on different levels of the process. While the most important function of the monitor in the Levelt model was to check whether the intended meaning was actually expressed, de Bot (2003) suggests that the monitor also plays a role in preparing the language production system for the use of one or more specific languages.

One of the most recent developments in the tradition of Levelt (1989) is Kormos’s (2006, 2011) bilingual speech production model. Here too, speech production comprises different encoding modules: the conceptualiser, the formulator, and the articulator. Similar to L1 processing, L2 speech production is considered to be incremental in the sense that a fragment of a module’s input can activate encoding procedures in the module. Unlike in other models, however, processing is not necessarily serial. Providing that learners have reached a particular level of proficiency and the encoding process does not need conscious control, parallel processing may take place. Furthermore, an L2 specific knowledge store of declarative knowledge is postulated, which includes those syntactic and phonological rules that are not automatised and integral to the encoding systems yet.

Such models of speech production are able to represent the processing system underlying speech production, showing how spoken language draws on linguistic systems. In this regard, they can be useful as a basis for theory-informed cognitive validation of speaking tests (Weir 2005: 15). However, they have two major weaknesses. To begin with, the staged processes including message generation, grammatical encoding, phonological encoding, and articulation are presented as relatively independent and autonomous. There is not enough flexibility in these models to allow for interactive processing at different levels. That is, the modularity and unidirectionality in these models cannot capture adequately the interdependencies that may exist between different levels of processing and activation (Ellis 1999: 24). Secondly, and more importantly, the models do not go beyond the point of utterance. Aspects of speaking performance connected to the communicative context, purpose, or interactional nature of communication do not feature in such models (Hughes 2002: 31). It seems, then, that they cannot fully account for the variation typically encountered in second language speaking performance.  Process-oriented approaches to speaking

Another, more pedagogically oriented model of speaking that has been productive in teaching and assessing second language speaking is Bygate’s (1987) model of speech as a process. The characteristics of speaking are hypothesised to derive ← 41 | 42 → from two kinds of conditions under which speaking takes place: processing conditions and reciprocity conditions. While the former is related to the internal conditions of speech, including, for example, the fact that speech takes place under the pressure of time, the latter refers to the relation between the speaker and the listener, involving the dimension of interpersonal interaction in conversation. Bygate starts off with a basic distinction between knowledge and skills, knowledge basically referring to a set of grammar and pronunciation rules, vocabulary, and knowledge about how they are normally used, and skills referring to the ability to use them actively. Both knowledge and skills are necessary for speaking. Furthermore, Bygate distinguishes three levels of processing: planning, selection, and production. Firstly, in an interactive speaking context, planning requires knowledge of informational and interactional routines. Informational routines are frequently recurring types of information structures such as stories, descriptions of places and people, presentation of facts, comparisons, and instructions, whereas interactional routines are routines typical of interactions, i.e. turn structures typically associated with situations such as service encounters, telephone conversations, interview situations, casual encounters, or lessons. Speakers also need to know the state of the ongoing discourse. The skills needed to use this knowledge include message planning skills (information plans and interaction plans), on the one hand, and management skills (agenda management and turn-taking), on the other. These skills enable speakers to plan their messages and interactions both in terms of content and discourse. Secondly, at the selection level, the speakers’ knowledge of lexis, phrases, and grammar resources will determine the choice of how the plan is executed. The corresponding skills relate to the negotiation of meaning and include explicitness skills, which concern a speaker’s choice of expression in view of what the interlocutor knows, needs to know or can understand, and procedural skills, which concern a speaker’s procedures to ascertain that understanding takes place. Luoma (2004: 104) points out that together the planning and selection activities can be regarded as interactional skills since they concern the way speakers relate to others in conversation. Thirdly and lastly, at the production stage, which is closely related to the processing conditions such as real-time processing, speakers require knowledge of articulation devices, grammatical and pronunciation rules. The corresponding skills are facilitation, including, for example, simplifying structures, using ellipsis, formulaic expressions, fillers, and hesitation devices, and compensation, such as self-correction, rephrasing, repetition, and hesitation. While these three stages of processing are considered to be the same for first and second language speakers, Bygate (1987) acknowledges that learners of a language need strategies to cope with and compensate for deficien ← 42 | 43 → cies in their knowledge and skills. Such additional skills include achievement and reduction strategies, which enable speakers to compensate for a language gap by providing a substitute without altering the intended message or reduce the intended message so as to adapt it to the currently available language resources, respectively. Figure 4 provides a summary of Bygate’s (1987) model.

Figure 4:    A summary of oral skills (Bygate 1987: 50)


← 43 | 44 →

Both models of communicative competence and process-based approaches to speaking have drawn attention the components of language ability and the processes involved. They have highlighted the fact that the processing skills required for speaking differ from those involved in listening, reading, and writing. Such approaches have proven particularly useful for learning-related assessment of speaking. Luoma (2004: 106) points out that the organisation into processes such as planning, selection, and production or interaction provide a clear basis for organising learning activities and a rationale for choosing learning and assessment tasks.

On the other hand, such models are sometimes referred to as “speaker-internal” (Luoma 2004: 104), which stresses the fact that they conceptualise speaking as an individual phenomenon. Although they do mention the interlocutor, interaction, and the pragmatic purpose of the discourse, they consider speaking as a skill that primarily resides in the realm of an individual. McNamara (1997: 447) points out that although the notion of interaction has in fact featured prominently in language testing, it has been considered from a rather one-sided perspective. He asserts that the discussion has tended to focus on interaction as a psychological category referring to different types of cognitive activity within an individual speaker as opposed to a social category related to the behavioural reciprocity as the basis for the co-construction of speaking performance. Similar to the approaches to communicative language ability outlined above, the cognitive and process-oriented models of speaking regard interaction as a cognitive ability within an individual, even when referring to social interaction. Thus, McNamara (1997) calls for a broader, more dynamic understanding of social interaction, in which strategic competence exceeds the private knowledge of an individual and gains firm ground in performance within context.  Speaking as interaction

In recent years, there has been an increasing interest in alternatives to individual-oriented and cognitively grounded theories of language (Luoma 2004: 102). Communicative competence with its focus on the skills and knowledge that an individual language user has in terms of linguistic, discourse, pragmatic, and strategic competence to communicate accurately and effectively in a second language is said to be limited in the sense that it does not take account of conceptualisations of speaking performance as a co-constructed activity (Jacoby & Ochs 1995). In response, writers have concentrated more on the social context and how communication is perceived and constructed in a given situation. Going back to the seminal work of Vygotsky (1986), sociocultural theory suggests that as thinking ← 44 | 45 → and action are inextricably linked to each other and as action takes place in a particular community, the social context plays an essential role in an individual’s development. In fact, a person’s development cannot be adequately understood without taking account of the external social and cultural world in which he or she interacts. Social relationships are so fundamental to an individual’s cognitive development that they become the primary object of interest. Thus, cognition should be considered a social rather than an individual concept. Language in this view is culturally mediated and learned through encounters and experiences with others.

Theoreticians with a particular interest in the social context of language use and the construction of communication in interaction with others have put forward the theory of interactional competence. It seeks to describe and explain the variation in an individual’s performance from one context to another, the sociocultural aspects of discursive performances, and the interactional processes involved in the co-construction of these discursive performances by the participants. This constructivist perspective on interaction was first referred to as interactional competence by Kramsch (1986). Although she does not provide a definition of the term, she addresses the question of what successful interaction presupposes, namely “not only a shared knowledge of the world, the reference to a common external context of communication, but also the construction of a shared internal context or ‘sphere of inter-subjectivity’ that is built through the collaborative efforts of the interactional partners” (Kramsch 1986: 367). Criticising the oversimplified view on human interaction by the proficiency movement, which can even prevent the attainment of true interactional competence, Kramsch argues that an interactionally oriented curriculum must allow for a critical and explicit reflection of the discourse parameters of language in use. That is, foreign language learning is not only about developing language but also metalanguage skills, including the ability to reflect on interactional processes, to manipulate and control contexts, and to see oneself from an outsider’s point of view (Kramsch 1986: 369).

This notion of interactional competence was further developed by He and Young (1998) and Young (2000). Considering that “abilities, actions, and activities do not belong to the individual but are jointly constructed by all participants” (He & Young 1998: 5, original emphasis), they describe the linguistic and pragmatic resources that participants contribute to an interactive practice, including (a) knowledge of rhetorical scripts, (b) knowledge of certain lexis and syntactic patterns specific to the practice, (c) knowledge of how turns are managed, (d) knowledge of topical organisation, and (e) knowledge of the means for signalling boundaries between practices and transitions within the practice itself. Unlike communicative competence, which is considered a trait or set of traits ← 45 | 46 → inherent in an individual, interactional competence is co-constructed by everyone involved in an interactive practice and is specific to that practice. The participants’ knowledge and interactive skills are local and practice-specific, i.e. they apply to a given interactive practice and may or may not apply in a different configuration to different practices. Yet, participants make use of the resources they have acquired in previous instances of the same practice. Therefore, “individuals do not acquire a general, practice-independent communicative competence; rather they acquire a practice-specific interactional competence by participating with more experienced others in specific interactive practices” (He & Young 1998: 7). In other words, interactional competence is not a trait of an individual, but rather something that is constructed collectively by everyone involved.

This idea of interactional competence as a feature of discursive practice rather than an individual language user’s trait was further explicated by Young (2000). Interactional competence “comprises a descriptive framework of the socio-cultural characteristics of discursive practices and the interactional processes by which discursive practices are co-constructed by participants” (Young 2000: 4). Contrasting interactional competence with models of communicative competence, which focus on an individual language user in a social context, Young argues that the theory of interactional competence centres around the constructivist joint effort in communication, characterised by four features: Firstly, rather than dealing with language ability independent of context, it is concerned with language used in specific discursive practices. Secondly, instead of focusing on a single person, it focuses on the co-construction of discursive practices by all participants. Thirdly, the theory of interactional competence involves a set of general interactional resources that participants draw on in specific ways to co-construct a discursive practice. And finally, the investigation of a given discursive practice involves both identifying the configuration of resources that form the architecture of an interactional practice and comparing this architecture with others in order to find out which resources are local to that practice and to what extent the practice shares a configuration of resources with other practices.

The interactionalist approach has, of course, implications for the way performance and language constructs are defined (Bachman 2007). In the strongest sense, the interaction itself is the construct. The performance showing the ability to engage in interaction is interaction, and the resources that individuals bring to it are local and co-constructed in the discourse. Bachman (2007: 61) admits that this perspective enriches the current conceptualisation of language ability both in terms of its central components and its relation to the context. At the same time, he argues that in the strong interactionalist approach the relationship between ← 46 | 47 → interaction and language ability is an unresolved issue. What language testers are interested in are potential generalisations from consistencies in performance across a range of tasks. If, however, each discursive practice is co-constructed by all participants, it is unique; and if it is unique, then there cannot be any consistencies in performances across contexts and participants. As a consequence, there is no basis for generalising about the characteristics of either the contexts or the participants. However, if there are consistencies in performance, language testing theory is unable to explain or interpret them. In contrast to the strong interactionalist view, Bachman (2007) outlines a moderate and a minimalist approach. The moderate view contends that language ability interacts with the context and is changed by it (Chalhoub-Deville 2003: 372). That is, ability and context are two separate entities, but the ability is affected by interaction. The minimalist view also sees language ability and context interacting; however, the ability is not changed by the interaction (Chapelle 1998: 45). At any rate, the discussions focusing on speaking as interaction have shown that modern oral performance assessment is inconceivable without taking proper account of interactional competence.  An action-oriented approach: the CEFR

In Europe, the most influential document dealing with language ability and use has been the Common European Framework of Reference for Languages (CEFR) (CoE 2001). It adopts an action-oriented approach and considers language users and learners to be social agents who seek to accomplish all sorts of tasks – not only language-related ones – in a given context. In this view, language use

In order to carry out communicative tasks, users engage in communicative language activities, which may or may not be interactive. If such communicative activities are interactive, the participants alternate as producers and recipients of text, either spoken or written, often with several turns. In other cases, producers are separated from recipients. Thus, speaking is a communicative event that encompasses oral production and/or spoken interaction. In oral production, the language user produces an oral text aimed at an audience, such as speeches at public meetings, univer ← 47 | 48 → sity lectures or presentations. The CEFR provides illustrative proficiency scales for overall spoken production, sustained monologue (describing experience and putting a case), public announcements, and addressing audiences. In interactive activities, the language user functions as both speaker and listener with one or more interlocutors. Speakers and interlocutors construct conversational discourse conjointly through the negotiation of meaning, employing both reception and production strategies. Furthermore, cognitive and collaborative strategies to manage cooperation and interaction are employed by the participants. Examples of interactive activities listed in the CEFR include among others transactions, casual conversations, formal and informal discussions, debates, interviews or negotiations. Illustrative proficiency scales are provided for overall spoken interaction, understanding a native speaker interlocutor, conversation, informal discussion, formal discussion and meetings, goal-oriented co-operation, transactions to obtain goods and services, information exchange, and interviewing and being interviewed.

While engaging in communicative language activities, language users employ communication strategies to carry out communicative tasks. By means of such strategies, speakers mobilise and balance their resources, activate skills and procedures so as to meet the demands of communication in context, and complete the given task efficiently and successfully. The strategies include not only those to compensate for deficiencies in order to avoid breakdowns in the communication; they also refer to means to enhance the communicative effect. Illustrative scales for production strategies are provided for planning, compensating, and monitoring and repair, and there is one scale for reception strategies: identifying cues and inferring.

After delineating the communicative activities language users engage in and the communicative strategies they employ, the CEFR goes on to describe in detail the competences required to carry out tasks in communicative situations. While it acknowledges the contribution of all human competences to the ability to communicate, including, for example, knowledge of the world, sociocultural knowledge or intercultural awareness, it emphasises the significance of the specifically language-related communicative competence, which comprises linguistic, sociolinguistic, and pragmatic components. Illustrative scales include general linguistic range, vocabulary range, vocabulary control, grammatical accuracy, phonological control, orthographic control, sociolinguistic appropriateness, flexibility, turntaking, thematic development, coherence and cohesion, spoken fluency, and propositional precision.

In summary, the CEFR endorses an action-oriented approach to language use. In this view, performance is shaped by the activation of communicative – including linguistic, sociolinguistic, and pragmatic – competence and communicative ← 48 | 49 → strategies in the execution of various language activities, including reception, production, interaction, and mediation. The driving force in any language use situation is the task as a purposeful action necessary to achieve a desired result. The emphasis is on language as a tool that enables people to interact in social contexts. For language assessment, this implies that the assessment procedure should be performance and task-based.

While the CEFR scales are useful instruments for harmonising language teaching and testing practices in Europe and beyond, it should be emphasised that the scales are “essentially a-theoretical” in nature (Fulcher 2003: 112). They were compiled in a psychometrically driven way on the basis of intuitive teacher judgements rather than a theory of language ability or SLA research (Hulstijn 2007: 666) – a criticism that will be taken up in subsection 3.4. Accordingly, the CEFR can at best function as a heuristic model for practitioners helping them design language tests or learning activities (Fulcher 2010: 18). It cannot, however, be considered a model in an all-encompassing sense as an abstract description of what it means to be able to communicate in another language. Similarly, it would be wrong to interpret the progression depicted in the CEFR scales as the sequence in which languages are actually acquired and naive to accept the simplistic notion of language learners “climbing the CEFR ladder” (Westhoff 2007: 678).

2.3  Models of performance assessment

The models of communicative competence and the different approaches to describing speaking have been highly influential and productive in language assessment. Not only have they enriched our understanding of the construct in speaking performance assessment, but they have also advanced our understanding of the relationship between language knowledge, the underlying ability for use, and actual performance. They have also provided a theoretical foundation for test development and validation projects. Having said that, such models and approaches are insufficient in two significant respects. For one thing, they still reflect a somewhat limited understanding of the relevant factors in a speaker’s ability for use, that is, those factors contributing to the ability to perform that are not specifically related to language (McNamara 1996: 84). For another, they focus primarily on the individual, failing to take proper account of external factors operating in the test situation that may affect performance. Thus, the theories outlined so far cannot adequately delineate the boundaries between performance, the underlying construct, how the candidate reacts to the task in a particular situation, and the relationship between the performance and the score awarded. ← 49 | 50 →

Speaking performance assessment is particularly problematic in this respect as it usually involves a rater making judgements about the quality of the performance in real time. Unlike reading or listening skills, which can be assessed by discrete items scored dichotomously as correct or incorrect, speaking skills are usually assessed in a communicative situation, in which an extended sample of speech is elicited from the test taker and judged concurrently by one or more raters. It is easy to see how factors other than the candidate’s language ability can influence the judgements in performance assessment, including, for example, the rater’s language background (Kobayashi 1992), individual prioritisations of language features (Cumming et al. 2002), rater training (Knoch 2011a) or rating experience (Lim 2011). Not only rater variables but many other factors of the test situation may have a negative impact on the assessment outcomes. O’Sullivan (2012: 234) emphasises the potentially adverse effects of such variables when he calls them “areas of great concern to the test writer” as they may have systematic error effects, for example, on the predictability of the task response, interlocutor effects, effects of test taker characteristics on performance, and rating scale validity and reliability. Put more technically, there is room for unwanted variance in the test scores, or as Bachman et al. (1995: 239) point out, performance testing brings with it “potential variability in tasks and rater judgements, as sources of measurement error”. The prevailing models of communicative competence outlined above are not able to predict such types of effects.

In order overcome the insufficient treatment of such variables in the then existing models, McNamara (1996: 85) suggested a “three-pronged attack”. Firstly, continuous efforts should be made to develop a comprehensive model of communicative competence that is rich enough to explain the interaction between all parties involved and, in fact, any other variable that has the potential to shape performance in a systematic way. We need a better understanding of the significance of non-linguistic factors in performance and the impact they may have on the inferences we draw about a candidate’s proficiency. Secondly, research needs to investigate the impact that each of these variables has on the measurement. A comprehensive model can help to contextualise existing research on performance assessment and provide a theoretical framework for the formulation of hypotheses about the relationship between different variables in performance settings. Finally, decisions must be taken as to which of these variables are likely to be relevant in a particular test situation and what the practical implications are. In response to McNamara’s call for a broader understanding of performance assessment, several models have been suggested to identify and research the potential sources of variation in speaking assessment. The following subsections outline the most influential performance models. ← 50 | 51 →

2.3.1  McNamara (1996)

McNamara (1996) himself, drawing on Kenyon (1992), attempted to model performance and conceptualise potential influences in performance assessment. Emphasising the fact that language ability is to be assessed under performance conditions, i.e. as part of an act of communication, and the need to understand the effects of different variables on the pattern of test scores, he developed a model of proficiency and its relation to performance that takes into account potentially relevant factors. For a speaking test, McNamara (1996) suggested the model given in Figure 5 below.

Figure 5:    Variables influencing performance in a speaking test (McNamara 1996: 86)


Performance, ideally reflecting the candidate’s competencies, is placed in the centre of the model visually and conceptually. This performance, however, is influenced by a number of variables in the assessment context, indicated by arrows in Figure 5 representing the relational dimensions between dependable variables. One crucial factor influencing performance is the task. Essentially, the task is the vehicle which elicits and directs the performance. The candidates’ underlying competencies will influence the way they interact with the task in the communicative situation. The model assumes that the candidates draw on these competencies, which in turn have ← 51 | 52 → an impact on the task requirements and hence the performance. Other factors accounted for in the model pertain to the rating process, in which human judges rate a performance against pre-defined criteria. That is to say, the rating, most commonly expressed as a final score, can only partially qualify as a direct representation of performance. Moreover, specific to speaking tests, candidates may have to interact with one or several interlocutors, who may or may not share similar characteristics such as sex, age, educational level, etc. The relationship between interlocutor and candidate may or may not be balanced in terms of power and authority, language proficiency, or socio-economic status, depending on whether the interlocutor is another candidate (as in group-based oral assessment) or a teacher and/or examiner (as in conventional oral assessments). McNamara (1996: 86–87) emphasises that these variables, including non-linguistic factors, both cognitive and non-cognitive, must be modelled and investigated in research.

This seminal model of oral test performance is relevant to language testing for a number of reasons. Firstly, what it clearly shows is that a test score is influenced by a number of factors in addition to language ability, disproving the spurious assumption that a score is a pure index of the candidate’s underlying competence. The model has drawn attention to the fact that performance does not only depend on the ability to be measured; instead, it is a function of how the candidate’s competencies and characteristics interact with the given task characteristics. Similarly, the score is not simply the rater’s judgement of the performance; instead, it is a function of how the rater interprets the performance in relation to other factors, such as tasks and rating scales. Secondly, the model asserts the importance of contextual factors, most notably those related to the interlocutor. It allows for the interactive nature of most assessment contexts and the co-construction of meaning in such contexts. Finally, the model has served as a basis for language testing research aiming at better understanding the effects of the variables on the patterns of test scores (Berry 2007; Brown 1995; Lumley & McNamara 1995; McNamara 1996; North 1996; Purpura 1999; O’Sullivan 2006).

2.3.2  Skehan (1998, 2001)

While the Kenyon-McNamara model has brought about a deeper understanding of oral performance assessment, it has been criticised for failing to address the issue of how processing is adapted to performance conditions and for not making finer distinctions concerning the task component. Skehan (1998, 2001) therefore expanded the model in two respects in order to compensate for these weaknesses. Firstly, he argued that in performance testing not only the assessment of competences is required, but also the assessment of the ability for use: ← 52 | 53 →

Accordingly, the first addition to the model includes what Skehan calls the ability for use as a set of abilities “which mediate between underlying competence and actual performance conditions in systematic ways” (Skehan 2001: 169). Secondly, he argued that tasks need to be further analysed in terms of task qualities, on the one hand, and task implementation conditions, on the other. While the former refer to the task characteristics to generate the performance, the latter refer to the conditions under which a task is performed. Figure 6 is a visual representation of the extended model.

Figure 6:    Skehan’s (1998: 172) model of oral test performance


← 53 | 54 →

According to Skehan, exploring the different components and their influence on performance is only one side of the coin. In addition to examining these components in isolation, it is necessary to understand how they interact with each other. For example, rating scales, which are a source of test score variance rather than “neutral rulers” (Skehan 1998: 172) for measurement, can in fact have an intrusive influence if they address competing processing resources. As shown by Skehan and Foster (1997), different processing goals within performance, such as fluency, accuracy, and complexity, compete for processing resources. If the rating scale comprises each of these dimensions, then the score might well be influenced by the processing goal which the test taker prioritises at the time of the test. And if the rater in turn gives individual priority to one particular area at the expense of others, the influence might be even stronger. Similarly, task qualities and conditions may affect the performance selectively. Longer pre-task planning time, for example, leads to greater fluency (Mehnert 1998; Skehan & Foster 1997); tasks requiring differentiated outcomes result in greater complexity (Fulcher & Marquez Reiter 2003; Skehan & Foster 1997). Again, rating scales and raters may be selective in the areas they value.

2.3.3  Bachman (2002)

Bachman’s (2002) expanded model of oral test performance, given in Figure 7 below, is a further step in the development of performance assessment models. The significance of Bachman’s conception lies in the fact that it places greater emphasis on the impact of task characteristics in oral performance assessment. He criticises that previous models operationalise the notion of difficulty inadequately as an empirical artefact of test performance, either in the form of an average score on a given task or as the function of the interplay between construct and performance on the task, as opposed to a characteristic of the task itself. Noting that candidates with different competencies and abilities for use will find tasks with different qualities in different performance conditions differentially difficult, he argues that difficulty is a function of the interactions between all assessment components involved rather than a separate factor. Consequently, there should be greater emphasis on the task characteristics and their impact on the performance. ← 54 | 55 →

Figure 7:    Bachman’s (2002: 467) expanded model of oral test performance


2.3.4  Fulcher (2003)

Fulcher (2003: 115), finally, represents one of the most detailed models of oral test performance to date. He further refined the preceding approaches, providing a more elaborate account of the factors that directly or indirectly influence a performance. Figure 8 represents this expanded model of speaking test performance.

Probably the most significant achievement of Fulcher’s model is the crucial role assigned to construct definition at the centre of rating scale design. Together with the scoring philosophy and the orientation of the rating scale, the construct definition is considered to contribute the most to the score and its meaning. Task qualities and conditions, including task orientation, interactional relationship, goals, interlocutors, topics, situations, and additional task characteristics or conditions as required for specific context, continue to play a role in interpreting the score, but in comparison to Bachman’s conceptualisation, they are less prominent; they only form part of a larger system of many variables in operation. Moreover, Fulcher factors rater characteristics as well as rater training ← 55 | 56 → into the equation. The test taker component, finally, is considered to be shaped by individual variables such as personality, the abilities or capacities on the constructs to be tested, the ability to process in real time, and the candidates’ task-specific knowledge or skills.

Figure 8:    Fulcher’s (2003: 115) expanded model of speaking test performance


← 56 | 57 →

The models of oral test performance outlined above have been productive in language testing for a number of reasons. First and foremost, they acknowledge the fact that a test score is influenced by a number of factors in addition to the candidate’s ability, which means that the score is not a direct manifestation of the candidate’s underlying competence. Secondly, the models assert the importance of the contextual factors of assessment, taking account of the fact that assessment always takes place in a particular setting at a particular time, typically involving a number of participants. Thirdly, they allow for the interactive nature of most assessment contexts and even for the idea that meaning is co-constructed by the speakers in such contexts, which according to McNamara (1996: 85) has been insufficiently addressed within language testing. Finally, all expansions of the basic model have addressed McNamara’s (1996: 85) asserted need for “a model that is rich enough for us to conceptualise any issue we might think is potentially relevant to understanding second language performance”. Having said that, all writers (Fulcher 2003: 114; McNamara 1996: 88; Skehan 2001: 169) note that their models remain provisional and may be expanded and refined in the light of future research.

Studies aiming to better understand the effects of the variables identified in these models have proliferated since. Several authors have attempted to shed light on the impact that individual variables may have on patterns of test scores. A number of these studies have investigated observable test-taker characteristics, including gender, age, background, native language, and social status (Ockey 2009; O’Sullivan 2002), while others have focused on psychological and cognitive characteristics (Berry 2007; Norton 2005). Yet others have investigated the impact of the task on oral performance assessment (Skehan 2001; Wigglesworth 2000; Yuan & Ellis 2003). Research into interlocutor effects include Brown (2003), O’Loughlin, (2002), and O’Sullivan (2002, 2006). Rater performance, finally, was investigated by McNamara and Lumley (1997), O’Sullivan (2002) and Wigglesworth (1993). While there has been considerable research into the effects of person-related and task-related variables in performance assessment, empirical research on the rating scale component is rather scarce although it is dealt with conceptually in the literature. The following subsection elaborates on the significance of the rating scale component in performance assessment.

2.4  Rating scales in performance assessment

In all the performance models outlined above, rating scales and scale criteria constitute crucial factors. The amount of importance they attach to rating scales and the ways rating scale effects have been conceptualised, however, vary from ← 57 | 58 → model to model. While in his theoretical considerations McNamara does not further expand on the potential impact of rating scales or scale criteria, he restates the significance of rating scales in the context of Rasch-based rating scale validation, noting that rating instruments are particularly significant in certain types of performance assessment depending on the uses to which they are put (McNamara 1996: 182). Skehan, by comparison, is a lot more explicit about the rating scale component, pointing out that the test score is most immediately influenced by the rating procedures. Not only will the spoken performance have to be judged by human raters; in fact, it “will be filtered through a rating scale” (Skehan 2001: 168), which can differ from other ones in their characteristics and purposes. Therefore, there is a chance that the score assigned to a particular performance is potentially confounded by the rating scale and does not constitute a direct representation of a candidate’s performance. Skehan (1998: 172) emphasises that “rating scales are not ‘neutral rulers’, used in a simple manner to provide a measure of performance – rather, they can be an intrusion, a source of variance for scores which are assigned”. In other words, the score awarded to a specific performance may be biased and limited as a result of the rating criteria and instruments being used.

At the same time, possible interaction effects between the rating scale and other variables need to be recognised. As mentioned above, attentional capacities of second language users are limited, and processing goals within a performance may enter into competition. That is, constructs such as fluency, accuracy, and complexity may compete for processing resources, and test takers possibly exhibit trade-off effects between different aspects of performance. If the rating scale used in a particular test situation has a disposition to one of these aspects, then the final test score might be affected by the processing goals the candidate is pursuing in the test situation. In addition, if a rater prioritises one of these aspects over another one in the rating process, this may further influence the test score.

There may also be interaction effects between some task characteristics and the rating scales. Certain task qualities may predispose the test taker to particular aspects of performance when, for instance, tasks with differentiated outcomes result in greater complexity. Similarly, the given task conditions may have a selective impact on performance when, for example, an increased amount of planning time leads to greater fluency. Again, if the rating scale has a disposition towards particular aspects of performance, then it is obvious that the choice of task is not a neutral issue, but may influence performance and hence the performance rating. Skehan (1998: 173) concludes that a performance dimension by rating scale interaction needs to be factored in. It is important to understand the rating scale component, interaction effects, and their influence on test scores.← 58 | 59 →

Fulcher’s model (2003: 114) represents the most comprehensive attempt to consider the influence of rating scales in speaking test performance. In this model, rating scales are concerned in at least three major ways. Firstly, construct definition is placed in a central position in rating scale design. The specific understanding of what constructs are being tested and the inferences being drawn from the scores, which in turn influence the kinds of decisions made about the candidate, will govern the rating scale design and the formulation of band descriptors. Fulcher points out that this central importance of construct definition in rating scale design is not just happenstance; in fact, understanding the construct and the relationship between score meaning and construct definition is at the very centre of evaluating whether a test is useful for a given purpose. Secondly, Fulcher acknowledges that the nature and orientation of the rating scale as well as the scoring philosophy underlying it have an impact on the score and its meaning. Thirdly, and similar to Skehan (1998: 172), Fulcher (2003: 114) acknowledges an interaction effect between the rating scale and a candidate’s performance, which in turn affects the score and any kind of inferences made about the candidate. In this view, the rating scale and the underlying scoring theory contribute the most to the score and its meaning, and the effects of task characteristics are secondary to the effects of the rating criteria. It can be concluded that in Fulcher’s model rating scales are being considered as operationalisations of the test construct and the score meaning. Even more to the point, rating scales and their interpretations “act as de facto test constructs” in performance assessment (McNamara et al. 2002: 229).

Albeit in the context of diagnostic writing assessment, Knoch (2009: 280) expanded Fulcher’s conceptualisation of the rating scale component in two ways. Firstly, her empirical findings suggest that that the specific method employed in the scale development process has a direct influence on the rating scale. The choice and nature of the scale development method is in turn informed by the orientation of the scale and the scoring philosophy underlying it. Scale developers may take a number of methodological decisions in the construction process on the basis of the scoring philosophy they espouse. At the same time, the scoring philosophy may be changed as a consequence of the resulting rating scale. In view of this interrelation, scale development methods were added as a new variable to the model. Secondly, Knoch considers the relationship between rating scales and construct definition as reciprocal. Drawing on Fulcher (1996), who acknowledges the impact that the construct definition invariably has on rating scale design, Knoch (2009: 281) goes one step further and expressly acknowledges the reciprocal nature of the relationship, highlighting the impact a rating scale may have on the construct definition. ← 59 | 60 →

While the performance models outlined above differ in the extent to which they elaborate on the potential influence of rating scales and scale criteria on test scores, they all agree that the scales used in performance assessment are not neutral factors in the rating process but may exert a significant influence on the scores awarded. In addition, interaction effects between scales and other components in performance assessment may have a further impact. It can therefore be concluded that we need to understand the rating scale component more clearly. Indeed, the performance models presented in this chapter have provided a basis for research to investigate the effects of rating scales on the patterns of test scores. However, what the models to date have failed to do is to take validation procedures into consideration. While Knoch (2009) has included scale development methods as an additional variable, no one has modelled the impact of validation procedures in performance assessment as yet. The performance models have neglected the potential impact that validation methods may have on rating scales and band descriptors and with them indirectly on patterns of test scores. One of the purposes of this study is to investigate the potential impact that different validation procedures may have on the rating instrument and whether validation methods should be considered a relevant variable in performance assessment. Before the focus of the literature review turns to rating scale validation, a few general points about rating scales will be addressed, including common characteristics and types, theoretical and methodological issues in rating scale development, and the controversy surrounding rating scales in SLA research. ← 60 | 61 →

3  Rating scales

3.1  General characteristics

The past few decades have seen a tremendous increase in the number of language proficiency scales developed for various purposes. North and Schneider (1998: 217) point out that a general trend towards more transparency and comparability in educational contexts and the movement towards greater international integration, which entails the pragmatic requirement to define levels of attainment in language learning, have led to a proliferation of language proficiency scales. While some twenty or thirty years ago, most scales were directly or indirectly related to the US Foreign Service Institute (FSI) scale, an intuitive six-band holistic rating scale originally developed for military purposes, or its succeeding generations of scales, including the Interagency Language Roundtable (ILR) scale or the scale developed by the American Council on the Teaching of Foreign Languages (ACTFL), today a large number of scales exist that are independent of the FSI approach. Examples include the Eurocentres Scale of Language Proficiency (North 1991, 1993), the Finnish Scale of Language Proficiency (Luoma 1993), and the Association of Language Testers in Europe Framework (ALTE 2006).

The different backgrounds and purposes of such scales are reflected in the labels given to them. Alderson (1991: 71), for example, lists the terms “band scores, band scales, profile band, proficiency levels, proficiency scales, proficiency ratings” as labels typically given to such scales. While the terminology may vary, the scales all represent an attempt at describing an underlying hierarchical structure of discernible levels of language proficiency. In Galloway’s (1987: 27) words, they are “a hierarchical sequence of performances ranges”; according to Trim (1978: 6), they are “characteristic profiles of the kinds and levels of performance which can be expected of representative learners at different stages”. Davies et al. (1999: 153–154) define a proficiency scale as

a scale for the description of language proficiency consisting of a series of constructed levels against which a language learner’s performance is judged. Like a test, a proficiency (rating) scale provides an operational definition of a linguistic construct such as proficiency.

North (2000: 13–17) outlines three types of origins of language proficiency scales. The first type derives from the definition of examination levels. Formal examinations often provide scales of language proficiency by defining content and performance specifications at different levels in ascending order. These specifications in ← 61 | 62 → effect constitute a scale of language proficiency. Furthermore, large examination institutes sometimes present an existing suite of examinations in ascending order of proficiency, the University of Cambridge ESOL examinations being a case in point. The content specifications of the Cambridge ESOL General English exams, including the Key English Test (KET), the Preliminary English Test (PET), the First Certificate in English (FCE), the Certificate in Advanced English (CAE), and the Certificate of Proficiency in English (CPE), can thus be considered a hierarchical scale of increasing language proficiency. Secondly, the definition of stages of attainment is another origin of scales of language proficiency. Such stages of attainment are usually defined as part of a framework of educational objectives, course curricula or assessment procedures. They are typically holistic in the sense that they define general outcomes at different stages of an educational system or process, both in terms of degrees of skills in performance and in terms of the type of language that users can master at each level. Finally, and most commonly, current scales of language proficiency are in fact rating scales used for assigning a grade or score to a candidate in a particular test situation, most notably in performance tests of productive language skills. In contrast to limited production responses, which can be readily assessed by a dichotomous scale as either right or wrong, extended production responses in speaking and writing cannot be classified in this binary way. Instead, raters judge the quality of the response in terms of levels of proficiency by means of a multi-level rating scale. It seems worth pointing out that such scales are not in themselves test instruments but they need to be used in connection with tests geared towards the particular test purpose. In order to ensure a minimum degree of reliability of the measure across different test occasions, raters are normally subjected to comprehensive training phases.

Typically, a rating scale has a horizontal and a vertical dimension. The former commonly characterises what language users can do with the language in terms of tasks and functions. It encompasses the different traits or key criteria to be observed and assessed, including, for example, lexical and grammatical range and accuracy, fluency, coherence and cohesion. The vertical dimension stretches along a continuum of performance quality, describing a series of ascending reference levels that indicate qualitative differences in the performance in relation to the traits or key criteria, sometimes accompanied by numerical labels of measurement. The vertical dimension reflects the extent to which the key criteria of the performance have been demonstrated or mastered by the language user. Unlike the first two types of proficiency scales mentioned above, rating scales do not normally cover the full spectrum of levels from zero mastery to full proficiency, but a specific range between these two extreme points. ← 62 | 63 →

3.2  Types of rating scales

Rating scales can be classified into different types. The distinctions may depend on the nature of the underlying construct, the primary function or purpose of the scales, the target user, or the scoring approach embodied in the scales. Bachman (1990: 325–330), for example, distinguishes two types of rating scales, reflecting either a real-life or an interactional/ability approach to defining language ability. The basic difference between these two categories is the underlying conceptualisation of language ability and authenticity. While the former attempts to capture and assess what a language user at a particular level can do in the real world, the latter focuses on the description of a particular test performance. Level descriptors of scales embracing a real-life approach do not make a distinction between the ability to be measured and the characteristics of the context of performance. That is, tests employing the real-life approach would build on tasks typically encountered in the target language use domain. The ACTFL proficiency guidelines (ACTFL 2012: 5) are a case in point. The superior level in the ACTFL scale, for instance, is characterised by the ability to participate effectively in conversations on a range of topics, including social and political issues, in formal and informal settings, and present and support their opinions, using a range of interactive and discourse strategies. The guidelines thus refer to specific language use contexts, topics, and functions encountered in the real world. Level descriptors of interactional/ability-type rating scales, on the other hand, describe proficiency without reference to specific contextual features. Instead, they rest on the assumption that it is possible to extrapolate from test scores to the behaviour in the real word. Language proficiency is described in terms of component abilities, as is done, for example, in the Bachman and Palmer (1982) scale, which includes main scales for grammatical, pragmatic, and sociolinguistic competence, and sub-scales for the various aspects defined in their framework of communicative language ability.

Another typology of scales was proposed by Alderson (1991), who classifies proficiency scales in terms of the main purposes they may serve and the different groups of scale users. Aiming to disambiguate the use of rating scales, Alderson distinguishes between ‘user-oriented’, ‘assessor-oriented’, and ‘constructor-oriented’ scales of language proficiency. User-oriented scales serve to describe levels of performance so that test users, including, for example, employers or admission officers, can interpret the test results more accurately. The main function of such scales is to report information about typical or likely behaviours of candidates at a given level in addition to simply reporting a numerical score, thereby reducing the risk of spurious accuracy of raw, percentage, percentile, or standardised scores. Assessor-oriented scales aim at guiding assessors who are rating language ← 63 | 64 → performances during language tests. Such scales function as a common yardstick for human judges involved in the rating process in order to ensure reliability and validity of the scores. According to Alderson, a secondary purpose of such scales is to determine the nature of the test tasks used to elicit the language for assessment. Thus, the scales not only guide the rating process but also the kind of performance to be elicited. Finally, constructor-oriented scales serve to provide guidance for test constructors. Such scales constitute a set of specifications, stating the kind of texts, tasks, and items suitable for a particular test taker at a given level. They can thus help the test developers to decide which items and tasks should be included in a particular test instrument. Alderson (1991: 74) emphasises that all groups of scale users need to be clear about these purposes and that problems may arise if the three functions are obscure.

Building on Alderson’s typology, Pollitt and Murray (1996) propose a further function of scales, which can be referred to as diagnosis-oriented. Although many assessor-oriented instruments stipulate what aspects of performance assessors should concentrate on, Pollitt and Murray found that raters actually focus on different qualities at different levels, and lengthy descriptions of aspects which are not salient at a particular level may, in fact, not be very assessor-oriented. On the contrary, they may contain information that is not relevant for assessment, complicating rather than facilitating the rating process. Having said that, such descriptive schemes may have a diagnostic function in that they can help to give improved feedback on the performance.

While the two categorisations outlined above are distinguished by the conceptualisation of proficiency and the primary scale purpose, the following categorisation is based on the underlying approach to scoring. Scales belonging to this group differ in the way they represent claims about the relationship between the observable behaviour and the underlying constructs and domains. Holistic scales, also referred to as global or unitary scales, describe a performance holistically according to its overall properties rather than singling out particular features of a performance and providing information on these features separately. Hamp-Lyons (1991) divides holistic assessment into three categories. In holistic scoring, a single score encapsulating the overall quality of the whole performance is awarded. This approach considers a particular performance as the observable manifestation of a unitary construct without taking account of the components that make up that construct. In primary trait scoring, a single score is awarded on each individual task based on the one trait of the performance deemed most important, such as vocabulary or fluency. That is, if more than one task type is to be assessed, a number of rating scales will be required. This approach assumes ← 64 | 65 → that the construct to be tested depends on the context in which it occurs and that careful explicit statements have to be made connecting the observable behaviour and claims about the construct. Lastly, in multiple-trait scoring, several scores are allocated to each performance, with each score representing the quality of a particular feature of the performance or of the construct underlying the performance. That is, each score represents a separate claim about the relationship between the multiple constructs and the manifestations thereof in observable behaviour. This final category is similar to analytic scoring. Here too, a separate score is awarded for different aspects of the performance. Sometimes used interchangeably with the term ‘multiple-trait scoring’, analytic scoring at other times requires the counting of observable occurrences, such as errors, in order to arrive at a final score. Analytic scales provide descriptions of a number of performance features separately. In speaking tests, for example, commonly used categories include pronunciation, fluency, accuracy, and appropriateness.

While there is considerable overlap between the various types of scales, the choice will ultimately depend on the purpose of the test. The depth of information provided by the band descriptors and the complexity of the criteria included in the scale will be a function of the intended purposes. If, for example, a score is meant to support a pass-fail decision, then a single variable representing overall proficiency might suffice, whereas a formative period of language learning would call for more detailed information and feedback about particular aspects of language ability. The certification of different aspects of oral proficiency at the end of the BA programmes at Austrian English departments calls for analytic rating scales. While this subsection has outlined types and general characteristics of rating scales, the following subsection addresses some basic considerations in rating scale design as far as they were relevant for the development of the ELTT scales.

3.3  Theoretical and methodological concepts in rating scale development

Approaches to rating scale design have traditionally been classified in different ways. While Fulcher (2003: 92) distinguishes between intuitive and empirical methods based on whether or not empirical data is used to generate the scales, Luoma (2004: 83–86), following the methodology set out in the CEFR (CoE 2001: 207–211), distinguishes three categories comprising intuitive, qualitative, and quantitative methods. Fulcher et al. (2011: 7–9) distinguish between measurement-driven and performance data-based methods of rating scale development. While the former involve designing a scale on the basis of a measurement model, which determines what is and what is not included in the final version, the latter ← 65 | 66 → incorporate performance data into the development process, either in the form of detailed performance descriptions or reference points to establish differences between scale levels. The various approaches to rating scale development are summarised in Figure 9 below. Although this overview seems to insinuate that the three approaches are clearly separate, they are in fact often combined in the scale development process, resulting in hybrids of different methods.

Figure 9:    A framework for describing approaches to rating scale development


3.3.1  Intuitive approaches

Most operational rating scales have been developed intuitively by a priori methods, especially in low-stakes classroom contexts, probably because this approach is ← 66 | 67 → the most practical one. It is intuitive in the sense that it is based on expertise and experience; it does not involve any systematic collection and analysis of empirical data, nor does it draw on a particular theory or model of language ability. Instead, the scales are generated by the principled application of professional knowledge and skill. In practice, a qualified language teacher or test developer selects rating scale criteria and writes level descriptors, with or without consulting relevant source material such as existing scales or teaching syllabi. In addition, the scale writer may conduct a needs analysis of the intended target population. While in low-stakes testing situations such scales may be developed by an individual expert, in high-stakes contexts more often than not the construction process involves a committee of experts. The expert committee may consist of a small development team commissioned to draft a scale and a larger group of consultants providing feedback on the draft versions. Experience is particularly important in the revision and modification of intuitively written scale descriptors. By applying such scales operationally, the users develop an ‘in-house’ interpretation of the scale levels and refine the level descriptors in the light of their experience. Rating scales in the FSI tradition, such as the ACTFL scale described by Lowe (1987), are cases in point.

3.3.2  Theory-based approaches

Another approach to scale development can be referred to as theory-based. Rather than expertise or experience, a theory or a theoretical model or framework is used as the starting point for scale construction. Knoch (2011b) suggests four types of theories or models that could be used as a basis for rating scale development: (a) the Four Skills Model proposed by Lado (1961) and Carroll (1968); (b) models of communicative competence, such as those outlined in subsection 2.2 above; (c) theories or models of individual skills, such as Bygate’s (1987) process-oriented model of speaking; and (d) theories of rater decision-making, such as the descriptive framework proposed by Cumming et al. (2001, 2002). In addition, theoretically based rating scales can be derived from or informed by theories of SLA, such as Pienemann et al. (1988). Examples of theory-based rating scales include Hawkey and Barker (2004), McKay (1995), and Milanovic et al. (1996), who have chosen models of communicative competence as the conceptual basis for their rating scales. To the extent that these scales operationalise a theory or theoretical framework, they are considered to be representative of a theory-based approach to scale development. A number of authors have stressed the importance of a theoretical underpinning if rating scales are to allow valid judgements. Lantolf and Frawley (1985), for example, argue that the validity of a rating scale will be limited if the underlying framework does not take linguistic theory and research ← 67 | 68 → into consideration. Similarly, McNamara (1996: 49) notes that “an atheoretical approach to rating scale design in fact provides an inadequate basis for practice”.

3.3.3  Empirical approaches

The third approach to scale construction is based on the systematic analysis and interpretation of empirical data, qualitative and/or quantitative. Both qualitative and quantitative methods may use either band descriptors or performance samples as a starting point. While qualitative methods involve the intuitive selection, preparation, and interpretation of rating scale descriptors or sample performances, quantitative methods are “measurement-driven” (Fulcher et al. 2011: 7) in the sense that a statistical model shapes the appearance of the scale. One type of qualitative method, for example, aims to identify key concepts in draft descriptors and revise their formulation. For this method, a draft scale is split into individual descriptors, and a group of informants is asked to re-establish the original order. If differences occur between the reconstructed order and the intended order, informants try to identify the key features that were decisive for this difference. Similarly, informants can be asked to sort descriptors into piles according to the categories they seem to describe and/or according to the levels they are believed to represent. Additional comments on or modifications of the descriptors by the informant group will help to hone the wording of the scale descriptors. This is a technique first introduced by Smith and Kendall (1963) and later used by North (1996, 2000) to develop and edit the pool of descriptors that formed the basis of the set of illustrative scales of the CEFR (CoE 2001). These latter two methods have in common that they take descriptors as the starting point for scale construction. That is, scale developers write draft descriptors using intuitive or theory-based methods or collect and possibly edit descriptors from existing scales as input for the qualitative phase of scale development.

Other qualitative methods take actual performance samples as the starting point for scale construction and input for the qualitative phase. For example, a group of informants may be asked to tag performances with descriptors that best represent them and check whether the descriptors are coherent with what actually occurs in practice. Then, performances representative of each level can be investigated for the key features deemed characteristic of a particular level. These features are then incorporated into the descriptor wording. The rating scales developed by Shohamy et al. (1992) in a project to investigate the reliability of raters who differ in their professional backgrounds and amount of training are a case in point.

Yet another qualitative method is to identify the primary trait that determines the hierarchy of sample performances. First, expert informants are asked to rank ← 68 | 69 → order a number of performances. Then they identify and describe the criteria that were decisive in ordering the performances in the way they did. The end product of this process is a description of the primary trait or construct that is salient at a particular level. Upshur and Turner (1995) propose a variant of the primary trait method, in which sample performances are first divided into better and poorer ones. Then in a discussion process, the key features characterising the boundaries between levels are identified and subsequently turned into short yes/no criterion questions, resulting in a tree of binary choices that lead the raters to a score. This method is referred to as empirically derived, binary-choice, boundary definition method (Upshur & Turner 1995).

Another technique requires raters to discuss pairs of performances, make a comparative judgement as to which of the two samples is the better one, and provide justifications for their judgements. In this way, metalanguage is elicited from the raters that can be used to deduce the categories and salient features at various levels within each category. This approach was adopted by Pollitt and Murray (1996) in a study to investigate what raters focus on at different levels of proficiency.

More recently, Fulcher et al. (2011) suggested how the analysis of performance data from service encounters can be used to develop a new type of scoring instrument referred to as ‘Performance Decision Tree’ (PDT). First, the criterial discourse and pragmatic elements in effective service encounters were identified on the basis of authentic data and the relevant literature. Then, these elements were transformed into the PDT, which is essentially a boundary choice approach involving a number of yes/no decisions in relation to several discourse and pragmatic features. The authors argue that this “performance data-driven approach”, which incorporates features of empirically derived, binary-choice, boundary definition scales, has the advantage of generating richer and more meaningful descriptions of performance, which in turn have the potential to generate more reasonable inferences from score meaning to performance in the specified domain (Fulcher et al. 2011: 23).

While qualitative methods do not normally involve statistical analyses, quantitative methods, on the other hand, employ procedures to quantify material that may or may not have undergone a qualitative phase beforehand. Needless to say, such quantitative approaches to scale production require a fair amount technical and statistical expertise. As with qualitative methods, here too, the starting point may either be a set of (draft) descriptors for the categories at hand or a number of performance samples that have been assessed by raters. The latter was the case in a study by Fulcher (1993, 1996), who used a systematic research design ← 69 | 70 → to investigate real performances as the starting point for scale development. Focusing on fluency, he first employed discourse analysis to identify and count the occurrences of particular features of the performance. In a second step, multiple regression was used to determine which of these features had the greatest impact on the scores awarded by the raters. The features isolated in this way were finally incorporated into the scale descriptors.

Another quantitative approach to rating scale construction was adopted by Chalhoub-Deville (1995), who aimed to derive criteria underlying spoken performances across different tests. In her study, raters were asked to assess a number of performances holistically. The holistic scores were analysed by means of multidimensional scaling and linear regression with a view to identifying the features that were most decisive in determining the score.

Lastly, the most complex quantitative methods to scale construction involve item response theory (IRT), which is based on probability theory and can be used to calibrate proficiency descriptors onto the same scale just as items can be calibrated in an item bank. The scaling approach employing IRT-based techniques is associated mainly with the work of North (1995, 1996, 2000) in connection with the development of a common European framework for reporting language competence. Variants of this approach can be used not only for scale construction but also for post hoc scale validation, for example, to find out how raters actually use a rating scale in practice. A more detailed account of the principles of this approach and its use in scale validation is given later on in subsections 4.2 and 7.2.

3.3.4  Triangulation of approaches

All three main approaches to scale construction have their advantages and disadvantages. While intuitive methods are practical, relatively inexpensive, and less time-consuming, they have been criticised on the grounds that they often result in vaguely defined band descriptors. The descriptions are often too general or meaningful only to experienced raters who have been socialised into a common interpretation of the scale through extensive rater training and operational scale use. Outside the enclave of the in-house system, however, such scales may have little currency or fail to produce meaningful and reliable results (Alderson 1991). Another criticism of intuitively developed rating scales is that the relation between what is described in the descriptors and what happens in real performances is often very faint. In connection with the IELTS rating scales, Alderson (1991: 75) noted that there was not enough correspondence between the performance samples elicited by the test tasks and the descriptors in the rating scales. Furthermore, intuitively developed scales have been criticised for depending too heavily on the ← 70 | 71 → concept of the well-educated native speaker. The top levels of such ratings scales are often based on the abstraction of an educated native speaker, and the lower bands are typically defined in relation to the top levels. However, a number of writers have demonstrated that native speakers themselves show tremendous variation in ability so that the notion of the native speaker as an ideal language user remains elusive and thus problematic as a yardstick in rating scales (Bachman & Savignon 1986; Davies 1990; Lantolf & Frawley 1985). Arguably, the most telling criticism of intuitively developed rating scales refers to their relationship to findings from SLA research. Such scales tend to be functional, describing what users of the language can do, rather than developmental, representing acquisitional patterns of language development (Montee & Malone 2014: 853). Intuitive scales assume that language users develop along the continuum from the ability to use language as described in the lowest band to the ability to use language as described in the highest band of the scale, acquiring the features included in the scale in a linear, cumulative way. Pienemann et al. (1988), however, have argued there is no universal developmental schedule and that any pattern of development may not be assumed but must be theoretically and empirically verified. This criticism is of crucial importance to the present study and will be reiterated in more detail later on in subsection 3.4.

Some of the weaknesses of intuitively developed rating scales can be overcome by theory-based and empirical approaches to rating scale design. Explicit theoretical models of abilities in performance, for example, would satisfy McNamara’s (1996: 50) call for greater “definitional clarity”, which is imperative if the construct validity of performance-based proficiency tests reflecting the notion of communicative competence is to be more than just a claim. Basing a rating scale on an explicit theory or model of language ability would lead to more precision and meaningfulness of the scores beyond the immediate testing context. As Knoch (2011b: 85) points out, such models are generic in nature and do not depend on the specific context, which makes the results more generalisable and transferable across task types. The problem is, however, that to date no one model can explain and predict performances satisfactorily and completely. Although McNamara (1996: 85) considers the completeness, adequacy, and coherence of theoretical models as crucial, he admits with Lantolf and Frawley (1988, cited in McNamara 1996: 51) that “[a] review of the recent literature on proficiency and communicative competence demonstrates quite clearly that there is nothing even approaching a reasonable and unified theory of proficiency”. Although attempts to develop richer models of communicative competence, proficiency, and performance have proliferated since, this observation seems to hold true today as ← 71 | 72 → it did some twenty years ago. Knoch (2011b: 85) notes that although models of communicative competence have been used in rating scale design, they cannot provide a sufficient basis for test development. For one thing, is not clear if the various components of the model can actually be isolated and operationalised adequately in a test, nor is it certain how individual aspects of the model relate to and interact with each other. For another thing, the relationship between such models and actual performance may not be clear enough since some of the models used in test construction have not been empirically validated to a sufficient extent. Models of communicative competence, in particular, lack precision in terms of how the underlying competence is put into use. After all, such models have been developed to conceptualise underlying competence rather than actual performance. That is, no single model by itself can describe language performance adequately. Furthermore, theories of SLA are often too complex to lend themselves to straightforward application in rating scale development. Scales that truly reflect acquisitional patterns of language development may be overly general, failing to reflect adequately the performances elicited by the test tasks or the kind of real-life tasks a candidate is meant to perform (Montee & Malone 2014: 853).

Empirical approaches to rating scale design have the advantage that they can lead to very concrete descriptions based on data. In the context of producing a Common European Framework for languages (CoE 1996, 2001; Trim 1997), for example, the measurement approach seemed eminently suitable for generating a scale that could be used to assess learners of various linguistic backgrounds in various geographical and educational settings. Fulcher et al. (2011: 23), however, argue that performance data-based approaches are better able to produce meaningful or “thick” descriptions of language use as a basis for rating scale construction. Although expensive, time-consuming, and often not feasible for practical reasons, a combination of empirical methods, including measurement-driven and performance data-based methods, probably bears the greatest potential to generate scales that produce consistent results across different variables.

Since all methods outlined above have their own specific advantages and disadvantages, it can be concluded with the CoE (2001: 207) that

Methodological triangulation in rating scale development has the potential to compensate for some of the intrinsic weaknesses of any one scale construction method, produce less biased rating instruments, and ultimately increase the valid ← 72 | 73 → ity of the inferences drawn from the scores. One such multiple-method approach to developing rating scales is described in Galaczi et al. (2011), who report on how the findings from a range of sources, including consultation with experts as well as qualitative and quantitative studies, informed a rating scale development process in the context of Cambridge ESOL examinations. They present the activities carried out as part of the scale development process in the form of a two-dimensional grid, according to which the development process takes place along a chronological and a methodological dimension. While the former encompasses the developmental stages of establishing design principles, a componential analysis of individual aspects of the scale in isolation, and an operational analysis of the scale as a whole, the latter refers to the methodological steps taken in each stage, including consultation, development, and research. In the first stage, i.e. setting out the design principles, for example, the Cambridge ESOL assessment practice current at that time was reviewed in the light of the literature, decisions as to assessment criteria and the number of score bands were taken, and new descriptors were drafted on the basis of conversation analysis, reflecting the methodological steps of consultation, development, and research, respectively. In the next stage, the new descriptors were related to the CEFR levels, using a multi-faceted Rasch research design to obtain quantitative evidence and to identify problematic descriptors wanting further attention. Moreover, think-aloud protocols of the assessment process were created to gather qualitative evidence about the functioning of the rating scales. The operational analysis, finally, comprised two multiple-marking trials employing multi-facet Rasch measurement with a view to investigating the rating scales from a statistical perspective and comparing the assessments with those produced by the previous versions of the scales. Like any theoretical model, the one proposed by Galaczi et al. (2011: 223) has its shortcomings. Firstly, it fails to take account of the cyclical and iterative nature of the scale development process. Although Galaczi et al. (2011: 224) point out that the draft descriptors were “reviewed and refined in an iterative process”, their model itself depicts the process as linear. Secondly, as the model was designed expressly for large-scale speaking tests, it cannot be readily applied to smaller scale construction projects. That said, it offers a systematic approach to scale development and validation and a descriptive framework that spells out the development stages of the process. It makes a strong case for a priori validation and a systematic combination of intuitive and data-driven scale construction methodologies. Systematic methodological triangulation was in fact the guiding principle behind the development of the ELTT scales. Intuitive, theory-based, and empirical methods were combined in ← 73 | 74 → a cyclical and iterative scale construction process. How this triangulation was achieved is described in chapter five below.

3.4  Controversy over rating scales

Even if rating scales are the product of a combination of different approaches, they can be criticised on several grounds. Most criticisms of currently used rating scales are related to either the application of the scale in practice or the nature of the scale itself. As regards the former, one of the main concerns is that scales cannot be used in a reliable way. That is, users of the scale cannot interpret and apply the scale descriptors consistently because the rating scale may not allow a common interpretation. Although rater training has the potential to reduce rater variation (Barnwell 1989; Knoch 2011a; Weigle 1994), rating scales will allow different perceptions of the construct they operationalise (McNamara 1996: 126). A limited number of studies, mainly in the context of assessing writing, have investigated the practical problems rating scales create for raters. Smith (2000), for example, analysed think-aloud reports and found that raters had difficulty interpreting and applying some of the scale descriptors. Similarly, Shaw (2002) points out that although most examiners consider the level descriptors to be clearly worded and easily interpretable, one third of them do encounter difficulties applying the descriptors to performances. Brown (2000) used verbal recall protocols to investigate rater variability and found that the disagreement was caused to some extent by individual priorities that raters had given to different features of the performance. While some raters focused more on syntax, for example, others favoured discourse. That is, the rating scale, although holistic in orientation, allowed raters to consider aspects of the performance differentially and produce some idiosyncratic ratings. Analysing think-aloud data, Lumley (2002) noted that although raters follow a generally similar rating process, the relationship between the scale and the text remains obscure. Since rating scales do not cover all eventualities, raters are forced to develop their own strategies when trying to reconcile performance features with the wording of the rating scale descriptors.

As regards the nature of rating scales, most criticisms are directly or indirectly related to the approach taken by the scale designers to create the scales, and pertain to the underlying concept of communicative competence, language use, and/or the hierarchy of language abilities at various scale levels. As mentioned at several points above, intuitively developed rating scales have been criticised for their a priori nature (Brindley 1991; Chalhoub-Deville 1995; Fulcher 1993, 1996; North 1995; Upshur & Turner 1995). That is, the scales are produced pragmatically by experts, with the criteria being selected on the basis of expert judgements ← 74 | 75 → and in line with the local teaching culture. The validity of such scales is typically proclaimed by the authority of the scale developers or users rather than supported by theory or empirical evidence. Such approaches may generate scales which are atheoretical (Lantolf & Frawley 1985), which group features together that may not co-occur in real speech (Turner & Upshur 2002), which include features that do not occur in actual performance at all (Brindley 1998; North 1995; Turner & Upshur 2002), or which do not comply with SLA research (North 1993). Furthermore, the descriptor formulations of intuitively developed rating scales are often said to be imprecise (Alderson 1991), subjective (Mickan 2003), interdependent, and thus not criterion-referenced (Turner & Upshur 2002).

One of the main concerns about the nature of rating scales based on expert judgement is that such scales may fail to justify implicit claims concerning the progression of language proficiency. As mentioned above, analytic scales contain a number of criteria, each of which has descriptors at the different levels of the scale. The implication is that the rating scale describes dimensions along which language proficiency grows. Different levels of the scale are asserted to correspond to different degrees of difficulty. It seems logical to assume then that good test takers who have the abilities described high up on the scale will also have the abilities described at the bottom end of the scale. Conversely, however, test takers who are only able to do the activities described at the lower levels will not be able to do those higher up on the scale. Difficult language activities will contain features that candidates at lower levels have not mastered yet. More proficient candidates, by contrast, master these features. The assumption then is that to improve their language proficiency, candidates need to go through developmental stages in which they acquire or learn the features characteristic of more difficult activities. In other words, a candidate’s proficiency will have to grow along the path from the ability to do easy activities to the ability to do difficult activities. So the proficiency develops gradually until it reaches the degree which is required to perform also the more difficult activities. It develops along a dimension of growth, and the rating scale is a description or operationalisation of this dimension of growth.

However, it is far from clear whether or not, and if, to what extent intuitively developed rating scales in fact describe an implicational continuum of increasing language proficiency. A definition of scales by Clark as “descriptions of expected outcomes, or impressionistic etchings of what proficiency might look like as one moves through hypothetical points or levels on a developmental continuum” captures this major weakness (1985, quoted in North & Schneider 1998: 219 [emphasis added]). Lantolf and Frawley (1985, 1988, 1992) were among the first ones to criticise this weakness. They expressed doubts about the analytic logic of behav ← 75 | 76 → ioural rating scales such as the ACTFL and the Educational Testing Service (ETS) Guidelines by saying that they “are complicated ways of proving, propagating, and imposing analytic truths with reference to the model, and then masking these claims as empirical truths which reify pseudo-observation as fact” (Lantolf & Frawley 1985: 339). The definitions of the Guidelines were criticised as being reductive and self-proclaimed truths. They seem to prescribe what test developers think candidates should be able to do rather than describe what they actually do. Since there is no empirical evidence for the hierarchy of linguistic aspects, the ACTFL/ETS Guidelines seem to measure reality by definition, imposing analytic truths, and then passing off these claims as empirical. Similarly, Pienemann et al. (1988) criticised the notion of a universal pattern of language development intrinsic to such rating scales, emphasising that the notion of development must be grounded in theory and supported by empirical evidence. Not only do rating scales assume a linear incremental advancement, failing to take adequate account of phenomena commonly observed in SLA, such as backsliding and differential abilities in different discourse domains, but they also disregard evidence from SLA research that attributes variability to a number of factors that are not normally part of the test construct, including, for example, psycho-sociological orientation (Meisel et al. 1981), emotional investment (Eisenstein & Starbuck 1989), planning time (Ellis 1987), and the status of the interlocutor (Beebe & Zuengler 1983).

In its strongest form, this kind of criticism has led to the rejection of any form of ascending levels of proficiency and what has come to be known as the ‘Proficiency Movement’ in general (Lantolf & Frawley 1985, 1988, 1992). In its weaker form, this criticism has led to the call for more extensive research (McNamara 1996) and the combination of intuitive methods with empirical ones (Fulcher 1996). Accordingly, Brindley (1998) highlights the need for research to show that the stages of proficiency development described in rating scales correspond to the reality of language acquisition and use. Similarly, Pienemann et al. (1988) note that rating scales will remain susceptible to this sort of criticism unless research can demonstrate that there is a relationship between what is described in the rating scale, the model of acquisition implied, and what is actually observed in real performances. Therefore, it would be useful to have an empirical description of the dimension along which speaking proficiency grows. In other words, compiling a rating scale does not suffice; it also needs to be subjected to a process of validation. Research needs to show that the stages of proficiency described in rating scales correspond to the reality of language learning and use. Investigating the hierarchy of the proficiency descriptions featuring in the ELTT scales is indeed one of the main objectives of this study. Although the scales have been generated in a ← 76 | 77 → methodologically triangulated construction process, and a large part of the scale descriptors have been modelled on the empirically validated CEFR descriptors, it is by no means clear that the progression corresponds to real language use. To the extent that this study aims to investigate the validity of the hierarchy of the scale descriptors, it can be considered a validation study. The following chapter focuses on rating scale validation, including a few general remarks about the concept of validity, how it has changed over time, and a brief overview of some seminal rating scale validation research. It also introduces the specific aspect of validity that this study investigated. ← 77 | 78 → ← 78 | 79 →

4   Rating scale validation

4.1   Validity and validity evidence

The notion of validity and the corresponding validation processes have changed over time. While in the 1960s validity was seen as given if the test measured what it was supposed to measure and produced reliable test results across administrations, in the 1970s validity was discussed in relation to a number of test features that were assumed to contribute to validity. Accordingly, different types of validity were distinguished, including the classical trio of ‘criterion-oriented’, ‘content’ and ‘construct validity’. Each type of validity is related to the type of evidence presented to show that a test is valid. While criterion-oriented validity aims to demonstrate the relationship between a test and a criterion, either in the form of another test deemed valid or the performance on some future criterion, content validity is given if the content of a test is characteristic of the domain that is being tested. Construct validity indicates to what extent a test is representative of an underlying theory of language use and to what extent that theory can explain or predict the performance on a test. All these aspects were usually investigated separately and in isolation. It was not until Messick’s (1989) seminal article on validity that the general perspective changed from a divided to a unitary view of validity, which also takes account of the consequences of a test. He defines validity as

an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment (Messick 1989: 13).


ISBN (Book)
Publication date
2015 (December)
Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, 2015. 395 pp., 39 tables

Biographical notes

Armin Berger (Author)

Armin Berger is a Senior Lecturer in English as a Foreign Language in the English Department at the University of Vienna. His main research interests are in the areas of teaching and assessing speaking, rater behaviour, language assessment literacy, and foreign language teacher education.


Title: Validating Analytic Rating Scales