Developing C-tests for estimating proficiency in foreign language research

by John Norris (Volume editor)
©2018 Edited Collection 312 Pages
Series: Language Testing and Evaluation, Volume 39


This book explores the development of C-tests for providing efficient measures of foreign language proficiency in eight different languages: Arabic, Bangla, Japanese, Korean, Turkish, French, Portuguese, and Spanish. Researchers report on how C-test principles were applied in creating the new language tests, with careful attention to language-specific challenges and solutions. The final, five-text C-tests in all languages demonstrated impressive psychometric qualities as well as strong relationships with criterion variables such as learner self-assessments and instructional levels. These test development projects provide new tests for use by foreign language researchers, and they demonstrate innovative and rigorous test development practices in diverse languages.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the author(s)/editor(s)
  • About the book
  • This eBook can be cited
  • Table of Contents
  • Chapter 1: Developing and investigating C-tests in eight languages: Measuring proficiency for research purposes (John M. Norris)
  • Chapter 2: Design and development of an Arabic C-test (Michael Raish)
  • Chapter 3: Developing a C-test for Bangla (Todd McKay / Nandini Abedin)
  • Chapter 4: Development and validation of a Japanese C-test (Shoko Sasayama)
  • Chapter 5: A Korean C-test for university language learners (Young A Son / Amy I. Kim / Eunyoung Cho / John McE. Davis)
  • Chapter 6: Designing a C-test for foreign language learners of Turkish (Merve Demiralp)
  • Chapter 7: The C-test in French: Development and validation of a language proficiency test for research purposes (Corinne L. Counsell)
  • Chapter 8: The development of a Portuguese C-test for research purposes (Luciane L. Maimone)
  • Chapter 9: A computer-administered C-test in Spanish (Daniel Riggs / Luciane L. Maimone)
  • Chapter 10: Developing C-tests across eight languages: Discussion (Carsten Roever)
  • About the Authors
  • Series index

| 7 →

John M. Norris

Educational Testing Service

Chapter 1: Developing and investigating C-tests in eight languages:

Measuring proficiency for research purposes

Abstract: A major challenge to measuring and reporting proficiency in foreign language research is the lack of suitably efficient and effective instruments in numerous languages. The projects reported in this volume explored the development of C-tests for providing quick estimates of research participants’ global proficiency levels in a variety of less- and more-commonly taught languages. C-tests require examinees to supply the missing parts of words that have been deleted within multiple short texts, and they have been used successfully for making broad proficiency distinctions primarily in educational decision making. Here, the focus is on the development of C-tests, for use as proficiency measures within L2 research studies, in eight languages: Arabic, Bangla, French, Japanese, Korean, Portuguese, Spanish, and Turkish. Development of each test followed parallel procedures for (a) identifying texts at varying levels of difficulty, (b) selecting a language-specific approach to word deletion, (c) piloting the texts with first and second language speakers, and (d) evaluating and selecting a final set of operational texts. This chapter introduces the purpose and goals of the C-test development project in general, provides background on the C-test format and principles, outlines the specific steps taken in all of the projects, highlights some of the unique contributions of each chapter, and offers suggested next steps in developing and investigating C-tests for foreign language research.

Introduction: Measuring proficiency in L2 research

Measuring and reporting the second language (L2) proficiency of research participants is essential for the systematic, replicable, and interpretable study of language learning, language instruction, and related factors. The overall or ‘global’ second language proficiency of learners is a key covariate in most if not all L2 research, and it is likely that proficiency moderates everything from how learners respond to grammaticality judgments or affective questionnaires to how they perform on communication, assessment, and learning ← 7 | 8 → tasks.1 It is absolutely essential, then, to determine and report learners’ proficiency levels such that accurate interpretations can be made about the learners themselves and the language-related topics under investigation. According to Norris and Ortega (2012), three primary reasons for persistently measuring and reporting language proficiency in L2 research include:

1. To sample participants into a study or to assign participants to groups;

2. To indicate the extent to which study findings can be generalized to other studies or contexts with distinct learner samples and populations; and

3. To pay attention to the likely major ‘lurking’ effects of proficiency as a moderating variable.

For just one example of how learner proficiency can make a big difference in study findings, consider Kim (2009). In this study, Kim wanted to find out whether a particular task design had any effect on the extent to which learners engaged in learning-related behaviors during communicative interactions towards completing the task. The learning-related behaviors had to do with so-called “language-related episodes”, that is, points within the interaction where the learners actually focused on their own or each other’s language per se (e.g., by helping to identify a word in English, or by explaining the correct grammatical form to use). Kim’s learners engaged in both complex and simple versions of two different tasks (with complexity manipulated by adding reasoning demands or the number of elements to deal with in the task). Luckily, Kim also considered her learners’ English language proficiency and grouped them into distinct levels, based on two sources of evidence: (a) their enrollment level in an intensive English instructional program, and (b) their scores on the paper-based Test of English as a Foreign Language (TOEFL). Her findings showed dramatically different effects of the task designs on learning-related behaviors, depending on the English proficiency levels of the learners. For the reasoning demands task, the complex version of the task showed a very large negative effect on low-proficiency learners (i.e., they demonstrated many more learning-related behaviors on the simple version of the task rather than the complex version), yet ← 8 | 9 → there was a relatively strong positive effect on high-proficiency learners (i.e., they actually benefited from the more complex version of the task, as it led them to engage in more learning-related behaviors). Similarly, an interaction between learner proficiency and simple versus complex task design was found for the second task as well. Had Kim not measured and included English L2 proficiency as a potential moderating effect in this study, her findings might have indicated on average that there were no effects attributable to task design, yet it was apparent that task design had substantial and different effects, depending on the proficiency levels of the participants. Clearly, measuring, considering, and reporting L2 proficiency can make a big difference in what we interpret about language research.

However, as reviews over the past several decades have indicated, the global proficiency of learner participants is at best inconsistently measured or reported in most domains of L2 research, never mind investigated as a likely moderating variable (e.g., Norris & Ortega, 2000; Thomas, 1994, 2006; Tremblay, 2011). As a result, it is unclear to what extent proficiency is intervening in the accurate interpretation of research findings, both within and across studies. Returning to the example above, Kim’s study belongs to a domain of investigation typically referred to as “task complexity” research, of which there have been many hundreds of studies conducted in many different languages since the early 1990s. In a forthcoming research synthesis of this domain, Sasayama, Malicka, and Norris (in press) coded study reports for whether, and the ways in which, language proficiency was determined and reported. They found that some 24% of the studies reported participants’ L2 proficiency according to a measure that could be interpreted outside of the local study context (e.g., test scores on a commonly available proficiency assessment), while an additional 29% reported some kind of local proficiency estimate (e.g., instructional level within a language program) that could not be generalized beyond the study context. Some 17% of studies reported completely uninterpretable proficiency judgments like “intermediate learners of English”. They also found that only approximately 19% of the studies looked at L2 proficiency as a potential moderating variable (i.e., a variable that might affect findings, in the way observed in Kim’s study above). These findings are indicative of the considerable challenge confronting L2 research: it is clear on the one hand that learner proficiency can and does make a difference in how participants perform in L2 research; on the other hand, proficiency does not receive persistent or consistent attention, neither as a basis for understanding the learner populations sampled within a given study nor as a likely moderating variable that affects study findings.

A major challenge to measuring proficiency within L2 studies is the lack of available instruments that can be implemented in efficient and cost-effective ways ← 9 | 10 → and that apply across diverse research populations and learning contexts. Having participants complete a commercially available standardized proficiency assessment may impose a large time commitment (i.e., for study participants to complete the assessment) and a heavy financial burden on language researchers, rendering such proficiency assessments only marginally feasible for most research contexts. Another major limitation for researchers working with languages outside of English, and a few others that are more commonly taught, is the reality that available proficiency assessments in some languages are few and far between. Arabic is an interesting case in point, where published research related to Arabic language assessment is quite rare, and where nearly all commercial assessments measure proficiency according to a single, specific notion of the construct as reflected in the American Council on the Teaching of Foreign Languages Proficiency Guidelines (see discussion in Norris & Raish, 2017).

One possible solution to this L2 proficiency measurement challenge is the development of new, efficient, low-cost (or no-cost) assessments for the specific purpose of informing language research. Where such assessments can be created and made available to specific language research communities, their subsequent replication and evaluation across study contexts and participants might enable a basis for resolving at least some of the problems outlined above. The remainder of this chapter introduces one effort to launch just such a process across multiple languages.

Background and goals for the projects

The projects described in this volume were initiated as one relatively coherent effort to begin responding to the critical need for proficiency measurement in foreign language research. Serendipitously, the newly founded Assessment and Evaluation Language Resource Center (AELRC) had just been launched in 2014 at Georgetown University as a federally funded initiative to encourage “useful program evaluation and assessment in support of foreign language teaching and learning” (https://aelrc.georgetown.edu/). Support from the AELRC was secured at that time in order to develop and evaluate ‘short-cut’ assessments of proficiency in a variety of (primarily less-commonly-taught) foreign languages. The original goals of the project were to:

develop proficiency measures in multiple languages;

evaluate the measures for gauging global proficiency levels within specific learner populations; and

make resulting instruments/procedures freely available to the foreign language research community, thereby encouraging their use. ← 10 | 11 →

The decision to focus on ‘short-cut’ assessments came in direct response to the challenges of proficiency measurement highlighted above. Namely, for many languages, proficiency assessments simply do not exist or are not available for use in research. More to the point, assessments that are available often require considerable administration time—which researchers and participants seldom have—and they may come with relatively high price tags (e.g., large-scale standardized assessments designed for purposes other than research, such as university admissions tests). For these reasons, the main objective of the current projects was to demonstrate how maximally efficient measures of L2 proficiency might be developed, and to make the products freely available for subsequent use. So-called ‘short-cut’ assessments lent themselves nicely to the parameters of this work.

By ‘short-cut’ was meant the variety of language assessments that can, within relatively few items and short test-administration time, provide reliable and accurate estimations of holistic language proficiency across a broad range of levels. Such assessments included the classic reduced redundancy test types, which according to Spolsky (1969) were intended “…to test a subject’s ability to function with a second language when noise is added or when portions of a test are masked” (p. 10). Possibilities here were cloze, dictation, noise, and similar test types. In addition, more recent assessment innovations showed promise in providing robust capacities for distinguishing very widely differing learner proficiency levels while maintaining efficiency, including in particular the elicited imitation and C-test assessment approaches.

Among these ‘short-cut’ options, several have received attention specifically as solutions for filling the foreign language proficiency measurement gap identified above. For example, Tremblay (2011) reported on development of a French cloze test, and Lee-Ellis (2009) on a new Korean C-test, both for the purposes of controlling/accounting for the proficiency levels of learner population samples in their studies. Similarly, elicited imitation tests—where learners repeat a series of sentences presented to them aurally—have been developed in a host of different languages recently for the express purpose of gauging research participants’ proficiency (e.g., in French, Tracy-Ventura et al., 2013; in Chinese, Wu & Ortega, 2013; and in Russian, Drackert, 2015). Of interest for the current work, findings from these and related studies have persistently shown that these assessment types correlate well with criterion measures of global or holistic language proficiency.

For the current project, then, several parameters were set for determining the most likely ‘short-cut’ assessment approaches to be adopted. Assessments would need to be maximally efficient, such that a wide range of levels could be differentiated quickly. They would need to be portable, in the sense that both paper-based and computer-delivered formats were likely desirable, and administration ← 11 | 12 → in both modalities would need to be relatively uncomplicated. They would also need to be viable across different languages with quite different phonological and orthographic systems. Finally, they would need to demonstrate appropriate psychometric qualities (e.g., high reliability) while differentiating consistently across quite broad proficiency levels (e.g., ranging from Novice-High to Superior on the American Council on the Teaching of Foreign Languages Proficiency Guidelines; see Swender, Conrad, & Vicars, 2012).

On review, the C-test and elicited imitation test approaches were identified as most likely to meet these criteria, due primarily to their demonstrable efficiency as well as apparent ease of development, administration, and scoring. Both of these test formats also arguably tap into integrated language ability in that they both call to some degree on receptive (processing of written or aural input) as well as productive skills (writing words or speaking sentences). In the first phase of work, reported in this volume, the C-test was selected for development and investigation first, largely because it was deemed to be the more challenging proof of concept endeavor (as described in the next section). Ultimately, though, there may be good reason to pursue a combination of both C-test and elicited imitation formats, given that one takes place entirely in the written mode while the other is fully aural, thereby providing an interesting possibility for identifying distinct learner L2 proficiency profiles.

The current C-test projects were thus launched between 2014 and 2015 under the general guidance of this chapter’s author, and with financial support of the AELRC. Test developers/researchers were graduate students and language teachers who were either first language speakers or very advanced second language speakers of the languages targeted for test development. The team of researchers met regularly and exchanged ideas about C-test development, with the goal of agreeing upon a general set of practices that might guide their (and future) test development efforts. Prior to describing these practices, though, an introduction to the C-test approach provides essential conceptual and empirical background, in the following section.

C-tests: Origins, constructs, uses

The C-test was developed in the early 1980s as an efficient and potentially more effective alternative to the cloze procedure for estimating global language proficiency in the written modality. Introduced by Raatz and Klein-Braley (1981), the C-test approach features deletion of the second half of every second word within a paragraph-length reading passage, and examinees are required to reconstruct the original words to create a coherent text. An example C-test text in English might look like the following. ← 12 | 13 →

The potential advantages over the cloze procedure (where entire words are deleted, typically every seventh or ninth) are several, chief among which is the fact that many more items can be completed within a much shorter reading space and amount of time. By compiling several texts (typically 3–5), quite a large number (e.g., 75–125) of items can be tested in a single, brief sitting (around half an hour). C-tests also tap into a greater number and variety of language features than cloze, as examinees try to reconstruct accurate responses. For example the passage above requires knowledge of at least vocabulary, syntax, morphology, and spelling, all of which also depend on overall reading comprehension.

What exactly gets measured by a C-test has been a matter of considerable debate and some investigation since its inception (e.g., Klein-Braley, 1985, 1996; Sigott, 2006). The original intended construct interpretation for the approach was purposefully general (Klein-Braley, 1997). As with other reduced redundancy tests (e.g., cloze, dictation) the idea was to reveal learners’ global or integrated proficiency in the language by confronting them with the challenge of reconstructing relatively large amounts (i.e., beyond a word or sentence) of randomly selected authentic communication. Given that first language speakers were able to do so with very high accuracy rates (e.g., Grotjahn, Klein-Braley, & Raatz, 2002), the C-test offered a means for gauging second language learners’ approximation of full competence in the target language.3 Over the years, researchers have probed more deeply into some of the specific language competencies that are called upon in completing C-test items and texts of different kinds, including vocabulary knowledge, reading comprehension, and others (e.g., Chapelle, 1994; Karimi, 2011; Grotjahn & Tönshoff, 1992). One upshot of this research has ← 13 | 14 → been the finding that, while C-test scores often correlate moderately well with measures of other discrete components of language knowledge, the C-test is not an equivalent measure of any one linguistic phenomenon. Rather, because the deletion strategy affects different types of words at different points in a sentence and across a text, multiple phenomena are affected and multiple types of language knowledge must be brought into play interactively to solve the reconstruction puzzle presented by the text (Babaii & Fatahi-Majd, 2014; Hastings, 2002; Sigott, 2004). Interestingly, research has also suggested that learners at different proficiency levels draw upon distinct strategies and competencies for solving C-test deletions (e.g., Janebi Enayat & Babaii, 2017; Kontra & Kormos, 2006; Stemmer, 1991), which of course is precisely how learners at different proficiency levels confront the challenge of trying to communicate in a new language based on the different linguistic resources at their disposal.

Other validity research has looked more broadly at the relationship between C-test performance and global indicators of language proficiency. For example, Eckes and Grotjahn (2006a) utilized Rasch Model measurement and confirmatory factor analysis to demonstrate convincingly that a German C-test provided a unidimensional measure of language proficiency that was the same as that tested by the four-skills sections of the Test of German as a Foreign Language, a large-scale standardized proficiency assessment. In another example, Norris (2006, 2008) presented both longitudinal and cross-sectional findings that demonstrated substantial, linear, and predictable growth in German C-test scores by learners advancing through multiple years of instruction in a U.S. university language program. These and other studies (e.g., Coleman, 1994; Klein-Braley, 1985; Sigott, 2006) provide support for the original interpretation of the C-test construct, that is, as an estimation of global language proficiency. Of course, there are important caveats to this construct interpretation that bear emphasis. Most obviously, some learners may develop language proficiency in a particular modality (e.g., oral/aural proficiency) but not in the form of literate communication ability, and it is important to keep these populations of learners in mind when using the C-test. Indeed, the intended use of the C-test really should be the key point of departure in making any validity claims, as Grotjahn (1996) emphasized: “There is no once and for all, fixed construct validity for the C-test. Rather […] the construct validity of each individual test must be separately demonstrated for each specific intended use with each population” (p. 96).


ISBN (Hardcover)
Publication date
2018 (May)
Language testing Test design Language education Measurement Innovation Applied linguistics
.: Berlin, Bern, Bruxelles, New York, Oxford, Warszawa, Wien, 2018. 312 pp., 25 fig. b/w, 49 tables

Biographical notes

John Norris (Volume editor)

John M. Norris is Senior Research Director at the Educational Testing Service, in the USA. He holds a Ph.D. in Second Language Acquisition. He has worked as a professor at the University of Hawaii at Manoa, and at Georgetown University, and as assessment specialist at Northern Arizona University.


Title: Developing C-tests for estimating proficiency in foreign language research