Validating Language Proficiency Assessments in Second Language Acquisition Research

Applying an Argument-Based Approach

by Anastasia Drackert (Author)
©2016 Thesis 239 Pages
Series: Language Testing and Evaluation, Volume 38


The book introduces the reader to an argument-based approach to validity as a way to improve test validation in Second Language Acquisition (SLA) research. Motivated by the need for practical suggestions for raising proficiency assessment standards in SLA research, it exemplifies the approach by validating two distinct score interpretations for a new Russian Elicited Imitation Test (EIT). Two empirical investigations with 164 Russian learners in the USA and Germany were conducted to evaluate the accuracy of the score interpretations associated with two distinct test uses. The EIT proved to constitute a reliable and valid instrument for differentiating between a wide range of oracy skills. The proposed cut scores enabled prediction of several levels of speaking and listening proficiency. The author concludes with implications for using the argument-based approach for validating assessments in SLA research, for the use of the developed Russian EIT, and for future research on Elicited Imitation Tests in general.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the Author
  • About the Book
  • This eBook can be cited
  • Table of Contents
  • Chapter 1: Introduction
  • 1.1 Assessment of second language (L2) proficiency in SLA research
  • 1.2 Review of proficiency assessment in SLA research
  • 1.3 The challenge of L2 proficiency assessment in SLA research
  • 1.4 Responding to the challenge: The current volume
  • 1.5 Outline of the book
  • Chapter 2: Second Language Proficiency
  • 2.1 Models of language proficiency and competence: Conceptualizations from educational assessment
  • 2.2. Development of L2 proficiency
  • 2.2.1 L2 Proficiency levels
  • 2.2.2 Complexity, accuracy, and fluency (CAF) measures
  • 2.3 Language proficiency: Psycholinguistic conceptualizations
  • 2.3.1 Levelt’s (1989) model of language production
  • 2.3.2 Hulstijn’s (2007, 2011, 2015) model of L2 proficiency
  • 2.4 L2 Proficiency: Bridging the gap between disciplines
  • 2.5 Summary: L2 proficiency construct in the publication
  • Chapter 3: Elicited Imitation
  • 3.1 How does the Elicited Imitation Test (EIT) work?
  • 3.2 Concerns about the Elicited Imitation format
  • 3.3 Types of Elicited Imitation Tests
  • 3.4 Scoring procedures
  • 3.5 Sources of difficulty
  • 3.6 Overview of EIT validation studies
  • 3.6.1 EIT as a measure of communicative competence
  • 3.6.2 EIT as a measure of implicit knowledge of particular structures
  • 3.6.3 EIT as a measure of global oral language proficiency
  • EIT as a measure of global language proficiency
  • EIT as a measure of L2 oral proficiency
  • 3.7 Summary
  • Chapter 4: Validity Evaluation
  • 4.1 Earlier conceptualizations of validity (Trinity model)
  • 4.1.1 Criterion model
  • 4.1.2 Content model
  • 4.1.3 Construct model
  • 4.2 Validity in new Standards (1999)
  • 4.2.1 Evidence based on test content
  • 4.2.2 Evidence based on response processes
  • 4.2.3 Evidence based on internal structure
  • 4.2.4 Evidence based on relations to other variables
  • 4.2.5 Evidence based on the consequences of testing
  • 4.3 Argument-based approach to validity
  • 4.3.1 Interpretive argument
  • Scoring
  • Generalization
  • Extrapolation
  • Decision/utilization
  • 4.3.2 Validity Argument
  • Scoring
  • Generalization
  • Extrapolation
  • Decision/Utilization
  • 4.3.3 The missing chain in the argument-based approach
  • 4.4 Validity evaluation in the SLA field
  • 4.5 Summary
  • Chapter 5: Validation Study 1
  • 5.1 Test use and context
  • 5.2. The developmental stage: Creating the test and the interpretive argument
  • 5.2.1 Test development
  • Instructions
  • Timing parameters
  • Scoring
  • 5.2.2 Interpretive argument
  • 5.2.3 Evaluation of inferences and assumptions during test development
  • 5.3 Appraisal stage: Challenging the interpretive argument
  • 5.3.1 Evaluation questions
  • 5.3.2 Method
  • Participants
  • Instruments
  • Background Questionnaire
  • Elicited Imitation Test
  • Procedures
  • Analyses
  • 5.3.3 Results
  • Compiling the final combination of EIT items (EQ 1)
  • Functioning of the scoring rubric (EQ 2)
  • Generalizability of the EIT scores (EQ 3)
  • Correlations between the EIT and Russian learning history (EQ 4)
  • Correlations between the EIT and learners’ self-assessment (EQ 5)
  • 5.4 Discussion
  • Chapter 6: Validation Study 2
  • 6.1 Test use and context
  • 6.2 The developmental stage: Creating the test and the interpretive argument
  • 6.2.1 Test development
  • 6.2.2 Interpretive argument
  • 6.3 Appraisal stage: Challenging the interpretive argument
  • 6.3.1 Evaluation questions
  • 6.3.2 Method
  • Participants
  • Instruments
  • Background Questionnaire
  • Elicited Imitation Test
  • Russian Speaking Test
  • Listening Comprehension Test
  • C-test
  • Procedures
  • Scoring and analyses
  • Scoring
  • Analyses
  • 6.3.3 Results
  • Reliability of the EIT (EQ 1)
  • Correlations between the EIT and the RST (EQ 2)
  • EIT predictive ability of speaking skills (EQ 3)
  • Accuracy of the EIT cut scores (EQ 4)
  • CAF measures across the EIT ability levels (EQ 5)
  • Correlations between the EIT and CAF measures (EQ 6)
  • Correlations between the EIT and the LCT (EQ 7)
  • EIT predictive ability of listening skills (EQ 8)
  • Correlations between the EIT and the Russian C-test (EQ 9)
  • 6.4 Discussion
  • Chapter 7: Conclusions, Limitations and Future Research
  • 7.1 Limitations
  • 7.2. Implications
  • 7.2.1 Implications for using the argument-based approach for validating SLA assessments
  • 7.2.2 Implications for the use of the Russian EIT
  • 7.2.3 Implications for future research on EITs
  • 7.3 Conclusions
  • Appendix A: Russian Elicited Imitation Test (k = 56)
  • Appendix B: Scoring Guidelines for the Russian EIT
  • Appendix C: Items Measurement Report (US and Germany, k = 56)
  • Appendix D: Items Measurement Report (US and Germany, k = 31)
  • Appendix E: DIF analysis
  • Appendix F: Russian Elicited Imitation Test (k = 31)
  • Appendix G: Improved Scoring Guidelines for the Russian EIT
  • Appendix H: Items Measurement Report (EIT, Study 2)
  • References
  • Series Index



1.1  Assessment of second language (L2) proficiency in SLA research

Assessment of second language proficiency is done for many purposes. In educational settings assessments are used to make decisions about students, to inform classroom teaching and learning, and to improve, ensure, and demonstrate the quality of an educational program (Norris, 2013). Ideally, for all of these educational purposes, the use of an assessment instrument leads to a concrete action that has an appropriate implication for the individual, school, or language program. In comparison to educational assessment, second language acquisition (SLA) as a scholarly field employs assessments as instruments for collecting data with the goal of answering research questions about the linguistic, cognitive, social, or educational “factors that are hypothesized either to enable or inhibit the rate, route, and ultimate attainment of L2 acquisition” (Norris & Ortega, 2012, p. 573). In other words, no direct actions are involved concerning the individuals who are assessed.

In SLA research L2 proficiency assessments are mainly used for three purposes. First, assessments are employed for selecting participants into a study. Here, the use of L2 proficiency assessments helps to justify the sampling of participants into a study or assigning participants to distinct groups (Norris & Ortega, 2012, p. 580). Second, during the analysis, depending on the research question, it is necessary to include a measure of L2 proficiency as a covariate because proficiency can directly influence L2 learners’ performance on language-related experiments or interventions (Hulstijn, 2012; Norris & Ortega, 2012). Third, after the data is analyzed, researchers draw conclusions and arrive at interpretations about how certain factors influence L2 learning, development, or performance, and thereby construe the knowledge that is accumulated within the SLA field. Again, they arrive at these conclusions on the basis of some type of assessment – be it a task, a test, or any other tool that allows researchers to elicit, observe, and interpret indicators of L2 proficiency. Obviously, in all cases information about participants’ proficiency guides readers of research as they determine to what extent the findings can be generalized to other samples of language learning populations. ← 21 | 22 →

Even though, in SLA assessments, no decisions are made that would have a direct impact on the individual participants, the concept of a justifiable test remains since claims based on assessments which are not valid might lead to inaccurate experimental results and interpretations or even to undesirable consequences beyond the SLA field. For example, Rosansky (1979) as cited in Shohamy (1994), exemplified how the Bilingual Syntax Measure tests in early morpheme acquisition studies led to a new policy for limited proficiency English speakers based on unreliable and invalid instruments. Shohamy (1994) suggested that if SLA researchers applied some of the procedures used by language testers in validating their assessments they would be able to obtain data that are more reliable and valid. Thus, even though one does not make educational decisions or takes actions about the individual participants in an empirical study, the question of validity remains important for SLA assessment, too.

1.2  Review of proficiency assessment in SLA research

Several scholars have conducted systematic investigations of L2 proficiency assessment instruments employed in SLA studies over the past several decades. Thomas (1994) investigated the conventions for assessment of target-language proficiency in empirical research on L2 acquisition between 1988 and 1992 in four key journals: Applied Linguistics, Language Learning, Second Language Research, and Studies in Second Language Acquisition. She identified four major categories of proficiency assessment techniques as a result of the analyses of the corpus of 157 experimental and observational studies: (1) impressionistic judgments (21 %), (2) institutional status (40.1 %), (3) in-house assessment (14 %), and (4) standardized tests (22.3 %); 2.5 % were classified as ‘others’.

Twelve years later Thomas (2006) repeated the study using the five volumes of the same journals published between 2000 and 2004. Having analyzed the corpus comprised of 211 empirical papers, she found that 19 % of the assessments of L2 proficiency were done by impressionistic judgments, 33.2 % of the studies used institutional status, 19.4 % used in-house assessments, and 23.2 % used standardized tests. A total of 5.2 % of the proficiency assessments were done by other techniques. The main difference with the previous overview undertaken 12 years earlier was a 7 % decrease in the use of institutional status and a 5 % increase in in-house assessment techniques. She also found some differences in the utilization of different methods to assess L2 proficiency across journals; however, reliance on impressionistic judgments and institutional status in more than 50 % of the cases did not indicate much progress in using carefully designed and validated techniques in SLA research 12 years later. ← 22 | 23 →

A further investigation of how systematically language proficiency is assessed in SLA studies was done by Tremblay (2011). She conducted a survey of studies published in three journals (Second Language Research, Studies in Second Language Acquisition, and French Language Studies) between 2000 and 2008. She found that a little more than one third of all the studies (53/144) assessed L2 learners’ proficiency independently with some assessment tool. The most commonly used tests in these studies were one or more original or simplified sections of existing standardized proficiency or placement tests (e.g., Greek Language Proficiency Test, Michigan Test, Oxford Proficiency Test, Test of Adolescent and Adult Language, some section of the TOEFL iBT exam), a cloze test or a C-test, and oral interviews or accent ratings. Of the studies that did not administer a test, the majority estimated L2 proficiency on the basis of classroom level or years of instruction, followed by existing proficiency scores (typically TOEFL iBT scores) and length of residence in an environment where the target language is spoken. In addition, Tremblay (2011) noted that many studies did not include sufficient details about the proficiency tools used. In particular, no information was provided on which sections of the standardized test had been used, whether the instruments had been standardized, how the oral interviews had been conducted and rated, and how recent the existing proficiency scores were.

Hulstijn (2012) summarized similar problems when he reviewed how the construct of language proficiency is measured in the study of bilingualism from a cognitive perspective. He reviewed a corpus of 140 empirical papers published in volumes 1–14 (1998–2011) of the journal Bilingualism: Language and Cognition. In the analysis Hulstijn only focused on the definition and measurement of language proficiency (LP) and its implications for (a) the definition and operationalization of language dominance, and (b) the selection of native-speaker control groups. Hulstijn (2012) found that in slightly more than half of the papers, in which researchers should have used the assessment of language proficiency as an independent variable or a covariate, language proficiency was not measured with an objective language proficiency test. Typical selection criteria used were age, language-acquisition history and environment (e.g., age of arrival in L2 country, length of residence in L2 environment, years of L2 instruction), and self-assessed proficiency in one or several domains of language use (listening, speaking, reading, and writing). Further selection criteria included performance on a researcher-administered language test or possession of a language certificate. Hulstijn (2012) also found that researchers seldom used participants’ proficiency scores in explaining variance observed in the dependent variable(s). In conclusion he recommended that in studies investigating between-group contrasts, re ← 23 | 24 → searchers should carefully consider the assessment of participants’ proficiency in the investigated languages, even in native-speaker (NS) comparison groups. In particular, the following suggestions were made:

        1.  Motivate the absence or presence of an objective measurement of LP to select candidate participants.

        2.  In case an objective LP test is used, motivate the type of test chosen, given the age, literacy and educational level of participants. Describe the test’s target group (age, literacy and educational level), skills measured, task(s), and materials in sufficient detail, and report its validity (if known) and its psychometric characteristics (e.g., internal consistency) for the target group.

        3.  For the assessment of language dominance, (i) use tests of oral reception and production rather than tests of reading or writing, (ii) exclude linguistic elements that not all adult NSs may be familiar with, and (iii) administer the test to NSs too, in order to verify whether NSs of lower intellectual, educational or professional profiles perform at ceiling. If assessment of LP only takes place in the form of a questionnaire (to be filled out by the participants themselves or their parents or teachers), make its items refer to communicative skills in the realm of BLC1 only (reception and production of oral language involving only high-frequency elements), or in BLC and HLC2 (unrestricted oral and written language use) separately.

        4.  Consider whether it is appropriate to analyse the data (in addition to, or instead of using ANOVA) with multilevel linear mixed modelling or similar techniques that allow the researcher to determine to what extent the LP-assessment data (within and across participant groups) account for the amount of variance observed in the dependent variable(s).


ISBN (Hardcover)
Publication date
2015 (November)
elicited imitation oracy score interpretations language testing
Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, 2015. 239 pp., 44 tables, 46 graphs

Biographical notes

Anastasia Drackert (Author)

Anastasia Drackert (née Mozgalina) holds a PhD in Linguistics with specialization in Language Testing from Georgetown University. She works and teaches in the areas of language assessment, foreign language education and task-based language learning and teaching. Her research appeared in a variety of journal articles and book chapters.


Title: Validating Language Proficiency Assessments in Second Language Acquisition Research
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
book preview page numper 27
book preview page numper 28
book preview page numper 29
book preview page numper 30
book preview page numper 31
book preview page numper 32
book preview page numper 33
book preview page numper 34
book preview page numper 35
book preview page numper 36
book preview page numper 37
book preview page numper 38
book preview page numper 39
book preview page numper 40
244 pages