Language Proficiency Testing for Chinese as a Foreign Language

An Argument-Based Approach for Validating the Hanyu Shuiping Kaoshi (HSK)

by Florian Meyer (Author)
Thesis 349 Pages

Table Of Content

| 11 →

1 Introduction

Since the Reform and Opening policy of the People’s Republic of China in 1978, the economic and political importance of China has grown enormously, and more and more individuals want or need to learn Chinese. Although reliable data about the worldwide number of all learners of Chinese do not exist (Sūn Déjīn, 2009, p. 19), there is evidence of a strong increase. In South Korea, there are around 100,000 learners in schools and universities, and together with those who study via TV, radio or other media, they exceed 1,000,000 (Niè Hóngyīng, 2007, p. 87). In Japan, Chinese has become the second most popular foreign language behind English with 2,000,000 learners (Sū Jìng, 2009, p. 88). Europe still lags behind; however, in Germany more than 4,000 students learn Chinese in intensive language programs at universities and colleges (Bermann and Guder, 2010), while an unknown number studies in optional classes. Together with learners at secondary schools, all students of Chinese in Germany number 10,000, leaving only France with more Chinese learners in Europe (Fachverband Chinesisch, 2011). In the United States, nearly 2,000 high schools already offer Chinese, which has become the third most popular language behind English and Spanish (ibid.).

Figure 1: HSK test taker development (black: foreign group; gray: Chinese ethnic minorities).
For 2006, bars estimated on a total number of 160,000 (Yáng Chéngqīng and Zhāng Jìnjūn, 2007, p. 108); other data from Sūn Déjīn (2009, p. 20). Data rights shifted from the HSK Center to the Hanban5 in 2005. ← 11 | 12 →

In addition, more and more students participate in language proficiency tests for Chinese as a foreign language (CFL) (cf. Figure 1), and the number of tests has also risen. Since the beginning of the 1980s by far more than twenty tests have been launched (cf. chapter 1.5). These tests fulfill different purposes, such as helping test takers enter a Chinese or Taiwanese university, placing students into appropriate language courses, giving credit points to students who have gained considerable knowledge prior to their studies, or helping companies to find employees who are able to do business communication and translation work. In South Korea, many job applicants are expected to be able to use Chinese due to the strong economic ties between China and Korea, and the level of proficiency is often directly related to salaries (Niè Hóngyīng, 2007, p. 87);there, the HSK (Hànyǔ Shuǐpíng Kǎoshì 汉语水平考试), the official Chinese proficiency test from the People’s Republic of China (PRC), has partially a major impact on test takers’ lives or affects “the candidates’ future lives substantially,” and the test can be considered as a high-stakes test (Davies, Brown, Elder, Hill, Lumley, and McNamara, 1999, p. 185; Bachman and Palmer, 1996, pp. 96–97).

The HSK has the largest test population of all CFL tests, and it has prompted the most research. In 2007, more than 1,000,000 test takers participated in it (Wáng Jímín, 2007, p. 126). In Germany, the HSK was the only CFL proficiency test which test takers could take until 2009, when the Taiwanese TOCFL (Test of Chinese as a Foreign Language Huáyǔwén Nénglì Cèyàn 華語文能力測驗) entered Germany. In 2010, the “new HSK” replaced the former HSK version.6 However, in China (2013) some universities still offer the old HSK.

In fact, overhauling the old HSK was necessary because it had several major limitations: the HSK resembled the format of a discrete-point test7; it did not directly assess oral and written productive skills; in addition, the score and level system was not easy to comprehend (cf. Meyer, 2009), which made it difficult for stakeholders to interpret the meaning of HSK scores. On the other hand, the HSK had several advantages: it was a highly standardized multiple-choice test with very high objectivity and reliability. Both latter qualities derived partly from the fact that the test almost exclusively used items in multiple-choice format. The test intended to measure the Chinese language ability needed for successfully studying in China, and test takers’ results were set in relation to a norm-reference group. It was a high-stakes test for many Koreans, Japanese, Chinese ethnic minorities, and in part, other foreigners interested in studying in China. The (old) HSK has now been used for ← 12 | 13 → more than 20 years, during which time it underwent changes, and research is still partly being conducted. However, the major question is which inferences can be drawn from test scores of test takers, especially those with a “Western” native language background, such as individuals from Germany. Therefore, this work will examine the quality of the (old) HSK, with the core question being whether scores or the interpretation of HSK test scores can be considered valid? Is it a fair exam, or is it biased8 in favor of Japanese, Korean or other East Asian test takers9? What do HSK scores tell us about learners of Chinese?

1.1 An integrative validation of the old HSK

Although many HSK validation studies have already been conducted, this is the first work providing an integrative validation approach, which attempts to incorporate all studies. But before starting this undertaking, one should stress one important fact: there is no perfect test. As Cronbach ([1949] 1970) has stated:

Different tests have different virtues; no one test in any field is “the best” for all purposes. No test maker can put into his test all desirable qualities. A design feature that improves the test in one respect generally sacrifices some other desirable quality. Some tests work with children but not with adults; some give precise measures but require much time; … Tests must be selected for the purpose and situation for which they are to be used. (ibid., p. 115; italics added)

Thus, this work examines whether the HSK is a valid test for a specific purpose.10 For what kind of use do the interpretations of HSK scores make sense? How can we interpret HSK scores and what inferences can we draw from HSK results? What is the intended use of the HSK, and what else should the HSK measure? In what sense are interpretations limited? What do the HSK and Chinese language testing research tell us about the quality of the HSK? What are the logical inferences leading from HSK test performance to conclusions about test takers? Which parts of the HSK consist of weak inferences that should be improved? And finally, what are the intended and unintended outcomes of using the HSK?

Another question concerns whether the HSK can be used as a diagnostic tool for the Chinese language acquisition process, especially for Western learners. ← 13 | 14 → Many Western learners did not consider (old) HSK “scores”11 a valid measure of their Chinese language competence, and they complained the HSK had several shortcomings. First, the HSK did not assess productive oral skills. Second, Chinese characters were displayed in all sections and subtests (e.g., also in the multiple-choice answers of the listening subtest). And third, the HSK was mostly a multiple-choice test showing features of a discrete-point test, which did not replicate authentic language tasks.12 In contrast, HSK researchers claimed that the HSK “conforms to objective and real situations” (Liú Yīnglín, [1990] 1994, p. 1, preface).

This work shows that the old HSK provided valid score interpretations to assess Chinese learners’ listening and reading abilities for the purpose of studying in China. Thus, one should consider the HSK’s specific purpose to evaluate its usefulness. The validation, or the evaluation of its usefulness, will be undertaken in chapter 4 based on HSK research. This validation study reveals weak aspects of the inferences drawn from scores of HSK test takers. For instance, inferences about test takers’ productive skills are rather limited. Hence, one major goal of this study is to clearly explain which parts of the HSK should be strengthened to provide a better estimate whether learners’ Chinese language abilities sufficed to study at a Chinese university. The validation approach used in this dissertation is an argument-based approach (Kane, 1990, 1992, 2006), which has been successfully used in recent years and has been adopted to develop the new Test of English as a Foreign LanguageTM (TOEFL®), the TOEFL iBT (Chapelle, Enright, and Jamieson, 2008).

In chapter 5, the HSK is used as a diagnostic tool estimating the learning progress of learners of Chinese in relation to the length of time they have spent studying the language in class. The study was conducted in Germany, which has one of the largest Chinese learning communities in Europe. Over two years, 257 test takers participated in this study13, and 99 learners (without any Chinese language background) provided a good estimate of how many hours an “average” German learner needed to spend in class for achieving a specific (old) HSK level. The main questions guiding this research are:

- Does a positive correlation exist between the time learners spent in Chinese language classes and HSK scores?14 ← 14 | 15 →

- If there is a relation between the time spent in classes and HSK results, what is the nature of this relation? Is it possible to estimate a regression line for predicting how long it takes to reach a certain level of proficiency in Chinese?

- What do these results tell us about the nature of the Chinese language acquisition process of German learners? What are the main factors influencing this process?

1.2 Why a validation of the old HSK is useful

This work (a) investigates language proficiency testing for CFL, (b) will give new insight into how Western test takers acquire Chinese, and (c) discuss these issues on the basis of theoretical approaches and methods from the field of testing (especially psychological testing). Thus, perspectives from different research fields and disciplines need to be incorporated which all overlap to a certain extent (cf. Figure 2). Chinese proficiency testing influences teaching Chinese as a foreign language (TCFL). Almost all large-scale CFL proficiency tests are based on word and grammar syllabi, which, in turn, have a huge influence on course books and other learning material. At the same time, CFL proficiency testing is strongly affected by the field of language testing, which is mostly dominated by Anglo-Saxon countries, particularly the United States and England. And finally, language testing is largely embedded in the theoretical grounds provided by psychological testing.

Figure 2: Localization of research fields relevant for this dissertation.

So, why does this dissertation investigate the old HSK which was replaced by the new HSK in 2010? First, the old HSK was the most widespread proficiency test for CFL in the world, and this dissertation deals with how German test takers perform on CFL proficiency tests,15 and by 2007, it was the only proficiency test available in Germany, and empirical research could be conducted on only this test. ← 15 | 16 → Second, the HSK has one of the longest histories of CFL proficiency tests. Researchers have generated a vast number of studies, which helped to develop and to improve the HSK, and this offers a rich pool for understanding how CFL testing and research in China has developed and functions. Investigations on the (old) HSK continued until recently (e.g., Huáng Chūnxiá, 2011a, 2011b).16 Therefore, by using the concrete tool “HSK” and its research history, this work highlights the crucial mechanisms generally inherent in CFL testing. To reach this goal, the fundamental debate about today’s test theory, the concept of validity, and a useful and feasible approach for validation have been integrated into this work. Hopefully, this will offer new aspects into CFL acquisition, and a better understanding of the “CFL construct” and its assessment. As Liú Yīnglín (1994d) clearly stated, testing in CFL—like in other disciplines as well—is an ongoing process of making compromises and finding an appropriate and useful trade-off. To understand these compromises, a concrete test must be integrated into a clear and integral argumentative framework explaining what the test intends to measure.

1.3 Research overview and approach

With the rise of the HSK in the PRC (1990)17 and the TOCFL18 in the Republic of China (2004), proficiency testing for CFL came to the agenda.19 More than 450 studies related to the HSK have been published, starting with Liú Xún, Huáng Zhèngchéng, Fāng Lì, Sūn Jīnlín, and Guō Shùjūn (1986).20 Many studies were published between 1989 and 2010 in the eight edited volumes on the HSK21, one edited volume deals with language test theory and CFL testing (Zhāng Kǎi, 2006a). The majority of these studies were conducted by professional HSK test developers22 for further improving the test. In the late 1990s, more critical studies followed, ← 16 | 17 → often published by test practitioners, such as test administrators or language teachers, the latter group engaged in this because their teaching got affected by the HSK. These studies were often related to washback issues. Figure 3 shows the number of HSK studies published each year.

Figure 3: Chinese studies related to the HSK or using it as a research tool (in total 421).

The Chinese literature on CFL testing has not received much attention outside of China although the number of standardized Chinese language proficiency test takers and test centers outside of China has constantly risen (Meyer, 2009). Mainland Chinese research can be divided into studies focusing on the old HSK, the Gǎijìnbǎn HSK (Revised HSK), and the new HSK. The old HSK related research can be subdivided into research on the three different HSK test formats, which covered different levels of Chinese proficiency: (a) the Elementary-Intermediate HSK, (b) the Advanced HSK, and (c) the Basic HSK. This dissertation primarily targets the Elementary-Intermediate HSK, which was the first test launched officially in 1990. This test (and its successor the new HSK) still has the highest total test-taking population of all CFL proficiency tests by far (cf. Figure 1), which is why the majority of all HSK studies examine this test. Because this dissertation focuses on the Elementary-Intermediate HSK, which is also the most important test for German test takers, it will only mention studies on the Basic and the Advanced HSK when necessary.23 HSK research was also conducted on different test-taker groups, especially on ethnic minorities, and on test takers from specific countries, ← 17 | 18 → mostly Asian countries because Asian test takers account for more than 95% of all foreign HSK test takers (Huáng Chūnxiá, 2011b, p. 61).24 Some studies investigated non-Asian test-taker groups, for example the situation in Italy (Xú Yùmǐn and Bulfoni, 2007; Sūn Yúnhè, 2011) or Australia (Wáng Zǔléi, 2009). Unfortunately, none of these studies explicitly differentiates between test takers who have a native Chinese language background and those who do not; exceptions are Yè Tíngtíng’s (2011) study on the situation in Malaysia and Shàn Méi (2006) who investigated the HSK’s face validity. This dissertation will initially provide data distinguishing between both groups, and it will give new insights to learners who have absolutely no native Chinese language background.25 HSK research covers a vast variety of topics, even the historical aspects of testing in China.26 Other HSK research deals with the first revised version of the HSK, the Gǎijìnbǎn 改进版 HSK (Revised HSK, launched in 2007) and the new HSK (新版 HSK Xīnbǎn HSK, launched in 2010). The volume edited by Zhāng Wàngxī and Wáng Jímín (2010) solely deals with the Gǎijìnbǎn HSK. Most studies on the new HSK have occurred in recent years, starting with Lù Shìyì and Yú Jiāyuán (2003) who published the first essay about the new HSK.27 Up to now, around 40 studies in total concern the Gǎijìnbǎn HSK and the new HSK. Both in China and in Taiwan, one monograph on CFL testing has been published.28 Wáng Jímín (2011) covers the whole spectrum of language assessment, with many examples coming from CFL testing, while Zhāng Lìpíng (2002) focuses completely on testing for CFL.

Compared to the situation in China, Western research is rather scanty. Several studies originated in the United States, most of which deal with classroom assessment (e.g., Bai, 1998; Muller, 1972) or test formats or test types (Ching, 1972; ← 18 | 19 → Lowe, 1982; Yao, 1995). Chun and Worthy (1985) discuss the ACTFL29 Chinese language speaking proficiency levels. Hayden (1998) and Tseng (2006) examine language gain. In Germany, only five studies on the HSK have been published (Meyer, 2006, 2009; Reick, 2010;Ziermann, 1995b, 1996). Fung-Becker (1995) writes about achievement testing for CFL, and Lutz (1995) presents some thoughts on methods for assessing the oral ability of learners of Chinese.30

On the one hand we can find considerable knowledge about CFL testing in China; on the other hand, outside of China nearly no literature exists. Thus, this work presents the major findings of the rich HSK research to a Western audience, and it will identify crucial questions in CFL proficiency testing and explain why a “perfect” language proficiency test for CFL will never exist because testing goals, test takers, and the context in which the Chinese language is used and assessed, as well as the resources and the testing technologies used will always vary and have to be specified and adjusted to the specific needs and uses of a test. However, the crucial points or main theoretical issues will remain. I hope this study can contribute to the above-mentioned fields by clearly revealing what these main issues are and how they affect CFL testing.

Over time, the quality of HSK studies has gradually improved. Studies in the 1980s were concerned with the foundation of the HSK, which especially included the target language domain, the scoring, and the reliability of the HSK. One of the main targets of researchers at that time was to provide norm-referenced scores and to make the HSK a stable measure. Validation studies began in 1986 and emerged in greater numbers in the 1990s. In the 2000s, washback studies emerged. Jìng Chéng (2004) claims that researchers who were not involved in the HSK test development process had no access to test taker data samples and could not generate results large enough to have statistical value, and the author argues that non-test developers had to engage in more qualitative research than quantitative (p. 23). However, HSK research maintained high quality and shifted from larger fields to increasingly specialized topics. Though confirmatory studies initially dominated HSK research, several studies were very critical and disclosed controversial points. Non-test developers later expanded on these critiques. One specific criticism stemmed from teachers and universities in the autonomous region Xīnjiāng, whose participants had outnumbered the foreign test takers after 1999 (cf. Figure 1, p. 11), and for whom the HSK became a high-stakes test because admission officers required HSK certificates as part of the decision-making process to admit ethnic minority students to Chinese universities and colleges. ← 19 | 20 →

Thus, some investigations on the HSK that are thematically related to this work provided rich information for this dissertation and were quoted in several chapters31, while some could be summarized in one or two sentences, and others were not mentioned because they did not provide new insights. The majority of HSK studies used quantitative approaches; qualitative studies investigating single learners only occasionally occur, though the idea of combining different methods in a useful way appropriate for the specific research field—triangulation (e.g., Grotjahn, [2003] 2007, p. 497;Kelle, [2007] 2008)—is known among Chinese language testing experts (e.g., Chén Hóng, [1997c] 2006, p. 235).

The HSK research was used to validate the test (chapter 4), and the validation focuses on the Elementary-Intermediate HSK. In chapter 2, the term language proficiency will be discussed in detail, to foster a better understanding of Chinese HSK research. In addition, terminology relevant for this dissertation will be defined. Chapter 3 provides the theoretical foundation of testing, presenting the quality criteria in language testing, and it explains the crucial term of validity, how this term has been understood in psychological testing, and how it is used in this dissertation. Based on this validity concept, the theoretical approach underlying the validation in this work will be depicted in detail. Chapter 5 is an extension of the HSK validation with an empirical investigation on HSK test takers in Germany. The validity argument for the HSK will be presented in chapter 6. Afterwards, the conclusion follows in chapter 7.

1.4 History of the HSK

Sūn Déjīn (2009) divides the development of the HSK into three periods: (a) an initial phase (chūchuàngqī初创期) from 1980 to 1990, (b) an expanding stage (tuòzhǎnqī拓展起) from 1990 to 2000, and (c) an innovative stage (chuàngxīnqī创新期) from 2000. A fourth stage started with the new HSK in 2010 and ended the innovative stage.

In 1981, the development of the HSK started with research on small-scale tests. By that time, the HSK was strongly affected by standardized language tests in the United States and England, especially the TOEFL, which had just reached the Chinese mainland and shifted the focus in Chinese foreign language didactics from language knowledge to language ability (p. 19; cf. Liú Yīnglín, [1988b] 1989, p. 110–111; Sūn Déjīn, 2009). After founding the “HSK design group” (Hànyǔ Shuǐpíng Kǎoshì Shèjì Xiǎozǔ 汉语水平考试设计小组) in December 1984, led by Liú Xún 刘珣 and consisting of ten members32, the first test was developed and ← 20 | 21 →pretested in June 1985 at the BLCU33 (Liú Xún et al., [1986] 1997, p. 77; Liú Yīnglín, [1990b] 1994, p. 45; Liú Yīnglín, et al., [1988] 2006, p. 23; Sūn Déjīn, 2007, p. 130; Zhāng Kǎi, 2006c, p. 1). Liú Xún reported the results of the 1985 pretest at the first conference on “International Chinese Didactics,” where they caused a “stir.” Afterwards, further large-scale pretests were conducted in 1986 and 1987; in 1988, the BLCU launched the first official HSK and issued certificates to the test takers, who have to pay a test fee since 1989 (Sūn Déjīn, 2007, p. 130, 2009, p. 19). By that time, the HSK consisted merely of the test format that was later renamed Elementary-Intermediate HSK.

From June 1985 to January 1990, 8,392 test takers from 85 countries participated in the HSK, and the examinations were held at 33 test sites in 16 Chinese provinces, cities, and autonomous regions (Liú Yīnglín and Guō Shùjūn, [1991] 1994, p. 12). From 1985, five large-scale pretests were administered once a year. In March 1989, BLCU established the Chinese Proficiency Test Center (HSK Center; Hànyǔ Shuǐpíng Kǎoshì Zhōngxīn 汉语水平考试中心; Zhāng Kǎi, 2006c, p. 2);the Center provided the professional basis for HSK development and research. In 1990, the HSK was appraised by experts and officially launched.

In 1991, the HSK was launched outside of China, and the number of test takers steadily increased.34 Because the HSK only assessed the elementary and intermediate proficiency levels, the “Advanced HSK” (Gāoděng 高等 HSK) was introduced in 1993, and the original HSK was renamed to Elementary-Intermediate HSK (Chū-zhōngděng 初、中等 HSK). In 1997, the Basic HSK (Jīchǔ 基础 HSK) entered the scene.

In 2000, the number of test takers reached 85,238, whereby 31,067 test takers were “foreigners” and 54,171 belonged to Chinese ethnic minorities. In this phase, research was conducted investigating to what extent the HSK fulfilled the needs of different stakeholders, which, in addition to Chinese learners, included universities, companies, and other organizations that used HSK scores of test takers for making decisions about university admission, employment, etc., and the HSK “product” was revised, also in terms of economic aspects (Sūn Déjīn, 2009, p. 19). In 2006, the HSK Threshold (Rù-mén jí 入门级 HSK) and the C.TEST (Shíyòng Hànyǔ Shuǐpíng Rèndìng Kǎoshì 实用汉语水平认定考试) were launched. The former test had been designed to measure the Chinese language ability of learners who had attended fewer than 200 study hours in Chinese. The test was developed to meet market demand created by rising numbers of Chinese learners outside of China who studied Chinese as a hobby. The C.TEST was created for assessing the Chinese ← 21 | 22 → language ability needed for working in China and daily life, and it should help Chinese companies recruit non-native Chinese employees (Sūn Déjīn, 2006, p. 4). In 2007, an oral examination was additionally offered, called “C.TEST oral examination” (C.TEST Kǒuyǔ Kǎoshì口语考试; Wáng Jímín, 2011, p. 36).

These years marked two further important incidents. First, in 2006 the total number of HSK test takers exceeded 1,000,000. Second, around 2005, the Chinese Ministry of Education withdrew HSK authorization from the HSK Center and shifted all rights to the Hanban35 (Lǐ Háng, 2010, p. 952), and the Hanban founded its own test section; thus, the HSK Center have not been able to access the data of the test takers since 2005–2006. Moreover, the first revision version of the HSK—the Gǎijìnbǎn HSK (改进版 HSK, Revised HSK), which had been developed and launched by the HSK Center on April 21st in 2007—was not supported and promoted by the Hanban. Actually, the Gǎijìnbǎn HSK should replace the old HSK (Zhāng Wàngxī and Wáng Jímín, 2010). However, in 2010 the new HSK (Xīn Hànyǔ Shuǐpíng Kǎoshì 新汉语水平考试) was introduced by the Hanban, which lowered the standards in CFL drastically; moreover, it amateurishly linked the test to the Common European Framework of Reference for Languages (CEFR; cf. Xiè Xiǎoqìng, 2011, p. 11). Not only because of the decrease in standards, but also because of the introduction of subtests assessing oral and written productive Chinese abilities and because of a massive promotion campaign executed by the Confucius Institutes outside of China, the number of test takers immediately skyrocketed in 2010 (cf. Sūn Yúnhè, 2011). In addition, the Hanban introduced the Business Chinese Test (BCT; Shāngwù Hànyǔ Kǎoshì 商务汉语考试) and the Youth Chinese Test (YCT; Xīn Zhōng-Xiǎoxuésheng Hànyǔ Kǎoshì 新中小学生汉语考试). With this background knowledge, the following paragraph by Sūn Déjīn, the former head of the HSK Center, can be seen in a completely new light:

… We [the researchers of the HSK Center] believe that the development and the existence of the HSK have to insist on scientific principles and directions. …If there is no scientific basis, there will be no future for the HSK. (Sūn Déjīn, 2009, p. 20)


1.5 Other Chinese language proficiency tests

Gaining an overview on proficiency tests for CFL has become increasingly confusing year after year. As Zhū Hóngyī (2009) notes, in Mainland China alone almost 10 tests already exist aiming to assess the Chinese language ability of non-natives of Chinese (p. 54). Obviously, an effort to mention all worldwide existing CFL tests would probably fail, and the scientific value of such a listing is also doubtful because it is difficult and usually not very fruitful to compare among tests because every test has its own specific purpose and circumstances (e.g., different target ← 22 | 23 → populations). Nevertheless, in this section a short overview of the most important proficiency tests for CFL will be given. The following aspects were considered when choosing specific CFL tests: (a) test-taking population size, and/or (b) Westerner participation, and (c) whether the test can be considered as high-stakes test. The HSK and the TOCFL have already been mentioned in the sections before. Because of the above-mentioned reasons, the list of tests below does not claim to be exhaustive.

The first test that one needs to mention is the Chūgokugo kentei shiken 中国語検定試験 (Chinese Proficiency Test), launched in 1981 by the Japanese Society for Testing Chinese (Nihon Chūgokugo kentei kyōkai 日本中国語検定協会).This test seems to be the first professional CFL proficiency test and it is designed for Japanese native speakers. As of 2011, a total of 75 exams were administered, in which 600,000 candidates participated. The test is offered three times per year. Out of all 600,000 test takers, 180,000 received a certificate. The listening subtest also includes a dictation, and the test has a translation subtest (Chinese–Japanese–Chinese). Every year, all 18 tests—three times per year, six formats—are published within half a year, together with audio recordings, answers keys and explanation sheets. Approximately 20,000 test takers per year currently take the test. In 2004, more Japanese took this test than those who participated in the HSK (Oikawa, 2009; Sū Jìng, 2009; Sūn Déjīn, 2009; Wikipedia, 2011; Yáng Chūxiǎo, 2011).

Another test from Japan, the Chūgokugo komyuni kēshon nōryoku kentei 中国語コミュニケーション能力検定 (Test of Communicative Chinese, TECC), was initiated by the Chūgokugo kōryū kyōkai 中国語交流協会 (Society for the Exchange of Chinese) and was launched in 1998. The test is designed to assess communicative Chinese ability. Chinese language experts and major companies in Japan that have trade experience with Chinese counterparts initiated the exam. Japanese companies willingly accept those certificates, and the number of test takers has significantly risen in recent years (Sū Jìng, 2009, p. 91). Though the name of the test claims to measure communicative ability, it consists of only a listening and a reading subtest, which last 35 and 45 minutes, respectively (Zhāng Lìpíng, 2002, p. 9).

In the United States, three major tests evaluate whether students have mastered the Chinese ability usually taught during a four-semester college course. The certificates are regularly used when applying for university admission. The CPT (Chinese Proficiency Test)was developed in 1983 by the Center for Applied Linguistics (CAL). The target population consists of English-speaking learners of Chinese, generally students who have studied two or more years of Chinese at a college or university in the U.S. The CPT has a listening subtest and a reading subtest (the latter also has a structure subtest). All response options on the listening subtest are in English, as well as all questions on the other two subtests, and all 150 items are multiple-choice items with four answer choices. The CPT offers a Cantonese version as well (Center for Applied Linguistics, 2010). In addition, the CAL offers a ← 23 | 24 → Preliminary Chinese Proficiency Test (Pre-CPT)for students who have studied Chinese in school for three to four years or for college students who have studied for at least one year.

The SAT (Scholastic Aptitude Test) Subject Test in Chinese with Listening measures the reading and listening abilities of students who have studied Chinese for two to four years in high school. It helps them to be placed into higher-level college or university Chinese language classes. The SAT is developed by the Educational Testing Service (ETS). It has three subtests: listening (30 items), grammar (25 items) and reading (30 items). Similar to the CPT, the tasks are mostly in English. All the items of the grammar subtest are displayed in simplified characters, traditional characters, Pīnyīn, and in the Taiwanese transcription Zhùyīn Fúhào (注音符號; also called Bopomofo).

The Advanced Placement Program® (AP®) offers a Chinese Language and Culture examination, which roughly equals a four-semester college course. The test is a computer-based test that is also administered by ETS. Questions are provided in simplified and traditional characters, and test takers can chose which system they use for writing (answers are typed using a keyboard). The test has four subtests: listening (30 items, 20 minutes), reading (35–40 items, 1 hour), writing (2 tasks, 30 minutes) and speaking (7 tasks, ca. 11 minutes), and the whole test usually lasts around 2 hours and 15 minutes. Questions and answer choices are all given in English, and the writing and the speaking task are holistically rated (The College Board, 2011).

1.6 Transcription system in this work

This dissertation uses the Hànyǔ Pīnyīn transcription for Chinese words and names. Exceptions are fixed names such as Peking University, Tsinghua University, or the above-mentioned Hanban. Normally, the order applied here is Hànyǔ Pīnyīn, Chinese characters, and then the English translation. When Chinese characters are in the focus, they might be placed in front, and where titles of studies, books or syllabi have been used, the English translation precedes. All Chinese authors who have published in Chinese are transcribed family name first, followed by his or her given name (without comma). The Pīnyīn spelling rules are followed according to the Xīnhuá Pīnxiě Cídiǎn 新华拼写词典 [Chinese Transliteration Dictionary], published in 2002. Thus, diacritics have been used in the entire work, and proper nouns are spelled in capitals. Korean names are transcribed by using the McCune-Reischauer Romanization, Japanese words with the Hepburn Romanization. Any translation or spelling mistakes are due to shortcomings of the author. This accounts also for block quotations from Chinese and their related translations.

5 The Hanban (Hànbàn *** 汉办 or Guójiā Hànbàn 国家汉办) stands for 中国国家汉语国际推广领导小组办公室 (Zhōngguó Guójiā Hànyǔ Guójì Tuīguǎng Lǐngdǎo Xiǎozǔ Bàngōngshì; “The Office of Chinese Language Council International”). It is a non-governmental and non-profit organization affiliated with the PRC’s Ministry of Education.

6 According to one high HSK official, the new HSK has absolutely nothing in common with the old one “despite its name” (private conversation in 2010). Official documents and research literature have an inconsistent spelling of the “new HSK” or “New HSK.” In this dissertation, the spelling “new HSK” has been adopted.

7 Such a test in Chinese is called fēnlìshì cèshì分立式测验.

8 If a test or an item favors a group of test takers, but the ability tested is influenced by another trait or feature of this group which is not part of the construct the test intends to assess, then a test or a specific item of the test can be considered as biased (cf. section 4.5.4).

9 In this work, the terms test taker, (test) candidate, testee, participant and examinee are used synonymously.

10 Ziermann (1996) compared the answering time length of the HSK listening subtest with other language proficiency tests, such as the TOEFL or the Certificate for German as a foreign language (Zertifikat Deutsch als Fremdsprache). This comparison underlies the assumption that it might exist an universal and appropriate answering time for listening subtests occurring in language proficiency tests in general (across languages and across tests), which is a true misunderstanding of testing.

11 Scores themselves can never be valid or invalid, just the interpretations of scores and their use can be valid or not. This will be explained in more detail in section 3.3.

12 The HSK consisted of 170 items. 154 items were multiple-choice items with four answer choices (one key and three distractors). In the cloze test (the last 16 items), test takers had to fill in blanks with characters to complete short texts.

13 The surveys were conducted directly after the test; and the survey was optional.

14 Other scenarios might also be possible. For instance, there could be a correlation up to a certain number of hours of Chinese classes a learner has taken, e.g., up to 1,000 hours, but after this threshold other factors could become more important for gaining language competence in Chinese (e.g., communicating with Chinese friends, watching Chinese movies, etc.); therefore, no correlation may be found above this amount of Chinese classes. If the relation between both variables is non-linear, the correlation coefficient normally diminishes.

15 In Europe, the HSK was first administered in Hamburg, on June 4, 1994 (Ziermann, 1996).

16 With DIF studies she investigated performance differences of Western and Asian test takers.

17 The HSK was reviewed by experts in 1990. In 1992, it became the official language proficiency test of the PRC (Liú Yīnglín, 1994, preface, p. 1).

18 In 2003, the test was originally named CPT (Chinese Proficiency Test). In 2007, the test was renamed TOP (“Test of Proficiency – Huayu”). On August 4, 2010 the Ministry of Education of the Republic of China announced that the “TOP – Huayu” would be called “Test of Chinese as a Foreign Language” (TOCFL) from that day on. The Chinese name––Huáyǔwén Nénglì Cèyàn 華語文能力測驗––has never been changed.

19 In 1981, the “Chinese Language Test” (Zhōngguóyǔ Jiǎndìng Shìyàn 中国語検定試験) was launched in Japan by the preceding organization of the “Japanese Society for Testing Chinese” (Rìběn Zhōngguóyǔ Jiǎndìng Xiéhuì 日本中国語検定協会). Approximately 15,000 test takers per year participate in this test (Yáng Chūxiǎo, 2007, p. 45).

20 Actually, more essays have been published. However, in some studies the HSK plays only a very subordinate role, so they have not been counted.

21 The last volume focused on the Gǎijìnbǎn HSK [改进版 HSK; Revised HSK].

22 The term “test developer” refers to individuals who design and develop tests or assessments. “Test users” refer to individuals who make decisions based on assessments.

23 The Advanced HSK was taken by few German test takers because German and other Western test takers almost never reached this proficiency level (Kaden, 2004, p. 4; Meyer, 2006).

24 These studies included, e.g., Korean (Cuī Shūyàn, 2009), Japanese (Sū Jìng, 2009; Yáng Chūxiǎo, 2007, 2011), Vietnamese (Lǚ Xiá and Lín Kě, 2007), Mongolian (Zhāng Ruìfāng, 2008, 2011; Sū Dé and Táo Gétú, 1999), Malaysian (Yè Tíngtíng, 2011) and Thai test takers (Lóng Wěihuá, 2011).

25 The distinction is important because among foreign HSK test takers a certain amount has a native Chinese language background, e.g., in Germany approximately 35% (cf. chapter 5).

26 Rén Xiǎoméng (1998) compared the HSK and the Chinese imperial civil-service examination system (Kējǔ 科举).

27 This essay was a political text by Hanban officials who wanted to “explain” why the previous research on the old HSK conducted by HSK research specialists would be meritless and not very scientifically fruitful. Other studies, e.g., Yáng Chéngqīng and Zhāng Jìnjūn (2007), explained why the old HSK should lower its difficulty to ensure better access to Chinese learners outside of China, and “promote” the development of the Chinese language.

28 Several unpublished master’s theses exist. In the library of the Graduate Institute for Teaching Chinese (Huáyǔwén Jiàoxué Yánjiūsuǒ 華語文教學研究所) at the National Taiwan Normal University (NTNU; Guólì Táiwān Shīfàn Dàxué 國立臺灣師範大學) one master thesis on grammar assessment for CFL could be found (Yáng Yùshēng, 2007).

29 The ACTFL (American Council on the Teaching of Foreign Languages) aims to improve and expand the teaching and learning of foreign languages in the United States.

30 Ziermann (1995a) wrote a master thesis (Magisterarbeit; unpublished) on one HSK conducted in Germany.

31 Sūn Déjīn (2009) says that nowadays HSK experts are able to discuss and exchange ideas with other leading experts on language testing at the same level, for example from Educational Testing Service (ETS) in the United States.

32 According to Zhāng Kǎi (2006c), the group had been formed in October 1984. Other founding members were Huáng Zhèngchéng 黄政澄, Fāng Lì 方立, Sūn Jīnlín 孙金林 and Guō Shùjūn 郭树军. In 1986, the core group consisted of Liú Yīnglín 刘英林, Guō Shùjūn 郭树军 and Wáng Zhìfāng 王志芳 (p. 1). Sūn Déjīn (2009) indicates only six people (p. 19).

33 BLCU stands for Beijing Language and Culture University, in Chinese Běijīng Yǔyán Dàxué 北京语言大学 (formerly called Běijīng Yǔyán Xuéyuàn 北京语言学院).

34 Statistics show that every year the HSK had more test takers in China than outside, at least till 2005 (Sūn Déjīn, 2009, p. 20).

35 Cf. footnote 5, p.6.

| 25 →

2 Language proficiency

Tautological as it may sound, language proficiency36 tests try to measure the proficiency of test takers in a certain language (Vollmer, 1981).37 However, what do we mean when we say someone has a specific “competence,” “level,” or “proficiency” in a foreign language? Not surprisingly, the way we define language proficiency or language ability has major implications on how we design language tests and on what construct we are assessing (cf. Chén Hóng, [1999] 2006, p. 248; [1997b] 2006, p. 208). Therefore, the following questions will structure this chapter:

- How do researchers in applied linguistics and language testing experts understand the notion of language proficiency?

- Does some sort of common definition exist for this term, with which a majority of experts in the field agrees?

- What are the central issues inherent in language proficiency that language-testing experts currently identify?

- How do Chinese CFL and second language acquisition experts use and interpret language proficiency and how does that influence the HSK design and CFL proficiency testing?

- Finally, what role does the construct of proficiency in CFL play in the validation section of this dissertation?

2.1 Definition of central terms

In language testing, a variety of specific terms are used, some of them covering a broad array of meanings, partly because they have been used for more than half a century, and partly because numerous authors as well as practitioners use them in various contexts.38 Therefore, in the following paragraphs some essential terms for this dissertation will be defined as they are understood and used in this work.

The terms “test,” “assessment,” “measurement,” “evaluation,” and “examination” are often used synonymously (Bachman, 1990, p. 18, 50f), and they “are ← 25 | 26 → commonly used to refer to more or less the same activity: collecting information” (Bachman and Palmer, 2010, p. 19),39 or “collecting data” (Cronbach, 1971, p. 443). The methods utilized for collecting this information (e.g., self-reports, questionnaires, interviews, etc.) and the way we record them (e.g., via audio or video recording, verbal descriptions, ratings, etc.) do not affect the above-mentioned terms (Bachman and Palmer, 2010, p. 20). Important are the conditions under which information is collected and what procedures are applied (e.g., Grotjahn, 2003, p. 9):

What is important, we believe, is that the test developer clearly and explicitly specifies the conditions under which the test taker’s performance will be obtained and the procedures that will be followed for recording this performance. Thus, we view “assessment,” “measurement,” and “test” as simply variations of a single process… (Bachman and Palmer, 2010, p. 20; italics in original)

In this work I will generally follow Bachman’s (1990) and Bachman and Palmer’s (2010) suggestion and use the terms “assessment,” “measurement,” and “evaluation” synonymously for referring to the activity or process of testing or assessing.40 The terms “test” and “examination” solely refer to the instrument used during the testing process (cf. AERA, APA, and NCME, 1999, p. 3). One of the most often cited definitions for test stems from Cronbach ([1949] 1970):

A test is a systematic procedure for observing a person’s behavior and describing it with the aid of a numerical scale or category system.41 (p. 26; original completely in italics)

Crocker and Algina (1986) define the term measurement the following way: ← 26 | 27 →

Measurement of the psychological attribute occurs when a quantitative value is assigned to the behavioral sample collected by using a test. (ibid., p. 5)

Therefore, an assessment has to collect information by using a test according to procedures that are systematic and substantively grounded (Bachman, 2004, pp. 6–7; Bachman and Palmer, 2010, p. 20), and it quantifies or at least categorizes the behavior of candidates. Grotjahn has described the systematic procedure-aspect as “controlled conditions” (kontrollierte Bedingungen; 2003, p. 9). This aspect, also called “systematicity,” refers to the point that tests

are designed and carried out according to clearly defined procedures that are methodical and open to scrutiny by other test developers, researchers, and stakeholders. (Bachman and Palmer, 2010, p. 20)

Assessments have to be replicable by other individuals at another time. Regarding the aspect “substantively grounded,” Bachman and Palmer (2010) write:

[A]ssessments are substantively grounded, which means that they are based on a recognized and verifiable area of content, such as a course syllabus, a widely accepted theory about the nature of language ability, prior research, including needs analysis, or the currently accepted practice in the field. (ibid., p. 20; italics added)

This second part of the specification of a language test, the “verifiable area of content,” has a dramatic impact on test design in general, and it is highly disputed among CFL proficiency testing experts;this disagreement can also be seen on the new HSK and its syllabi for words. Grotjahn (2000) narrows the above-mentioned definition of the term test by saying that another typical feature of tests is that they usually replace more exhaustive and extensive forms of collecting information, such as portfolios, by using more time-efficient and simpler procedures (p. 305). This characteristic is also typical for language proficiency tests.

A trait is a mental characteristic. Bachman (1990) says:

In testing we are almost always interested in quantifying mental attributes and abilities, sometimes called traits or constructs, which can only be observed indirectly. (p. 19)

In this regard, Bachman (2004) uses the term “unobservable ability” (p. 8), often also referred to as latent trait. This concept will be illustrated in more detail in section 2.2. According to the Standards, the term “construct” is not limited to characteristics that are not directly observable; there, it is used more broadly “as the concept or characteristic that a test is designed to measure” (AERA, APA, and NCME, 1999, p. 5).

An item is a single element of a test designed in a way to elicit certain behavior from the test candidate, which is evaluated independently from other test elements (Grotjahn, 2000, p. 305).

In this work, language proficiency will be used synonymously with Bachman and Palmer’s communicative language ability (1996). Ability42 is defined as the ← 27 | 28 → capability to implement language knowledge or language competence in language use (Bachman, 1990, p. 108).43 Bachman and Palmer (2010) say:

[W]e describe language ability as a capacity that enables language users to create and interpret discourse. We define language ability as consisting of two components: language knowledge and strategic competence. Other attributes of language users or test takers that we also need to consider are personal attributes, topical knowledge, affective schemata, and cognitive strategies.44 (p. 33)

This modern view of language ability, which takes the strategic components into account, has been recognized by many language testing experts and applied linguists (e.g., Bialystok, 1990; Chapelle et al., 2008; Widdowson, 1983). The term language proficiency––in Chinese often yǔyán shuǐpíng语言水平 (language level)––is adopted when a test measures the language ability of language users independent of “how, where, or under what conditions” (Bachman, 1990, p. 16) the test taker acquired his level of proficiency (amongst others Carroll, 1961; Oller, 1979; Spolsky, 1968). Therefore, language proficiency tests generally have no connection to language courses and language learning material (Grotjahn, 2003, p. 40). The most problematic point is that proficiency tests attempt to assess the language ability of language users over a wide variety of contexts45, for instance the Test of English as a Foreign Language (TOEFL) or the Test of German as a Foreign Language (Test Deutsch als Fremdsprache, TestDaF)46, which both try to measure the academic language proficiency of test takers. The HSK tries––first and foremost––to assess academic language proficiency as well. However, in language testing language use is always connected to context. This means that the language targeted by language proficiency tests, the so-called target language domain, although used in ← 28 | 29 → a wide variety of contexts, can never be transferred to all contexts.47 Therefore, the inferences drawn from the results of such tests must be limited, and they must be related to a specific target language domain. For example, a non-native Chinese might be highly proficient in reading academic Chinese or common journalistic texts, and in using Chinese orally in an academic environment, but might have problems when communicating with Chinese workers at a construction site because he or she is not familiar with the words, structures, and variety of the language used in such a context.48 Because of the problem that the term “proficiency” suggests that there is a particular kind of language proficiency across all contexts, which in turn can be measured by a specific language proficiency test, some experts in the field of language assessment prefer the terms “communicative language ability” (Bachman, 1990, 2005, 2007; Bachman and Palmer, 1996, 2010) or “communicative competence” (North, 1994). Chapelle et al. (2008) conclude:

A conceptualization of language proficiency that recognizes one trait (or even a complex of abilities) as responsible for performance across contexts fails to account for the variation in performance observed across these different contexts of language use. As a consequence, virtually any current conceptualization of language proficiency in language assessment attempts to incorporate the context of language use in some form (Bachman & Palmer, 1996; Chalhoub-Deville, 1997; Chapelle, 1998; McNamara, 1996; Norris, Brown, Hudson, & Bonk, 2002; Skehan, 1998). (Chapelle et al., 2008, p. 2; italics added)49

The term competence (or competencies) has been widely used in educational contexts, but no clear definition exists. Weinert (2001a) identifies six different concepts. For White (1959), competence is “an organism’s capacity to interact effectively ← 29 | 30 → with its environment” (p. 297). It is not clear, if competencies are a result of a successful activity (White, 1959) or the origin or condition for fulfilling an activity (McClelland, 1973). For Chomsky (1965), competence is the theoretical potential linguistic ability, while the actual use of language is performance. Modern concepts of competence include also motivational, volitional and social elements (cf. Weinert, 2001a). In addition, authors distinguish Fachkompetenz (“professional competence” or “expertise”), überfachliche Kompetenz (“generic competence”), and Handlungskompetenz (“ability or capacity to act”) (Weinert, 2001b).50 Rychen and Salganik (2003) underscore the influence of the context. Grob and Maag Merki (2001) add the probabilistic facet of competencies; a person might be competent to do something at a specific point of time, but not at another (cf. Chomsky, 1965) due to many factors influencing the performance. In addition, competencies are connected to emotions (Grob and Maag Merki, 2001, pp. 59ff.). For Klieme et al. (2003) a person is competent when he or she can solve specific problems (p. 72). Competencies are generally learnable and they can normally be influenced through training or experience (Maag Merki, 2009, p. 495). Overall, the term competence shows much overlap with the terms proficiency and ability, especially in terms of context, source/outcome of successful managed activities and learning. Furthermore, similarities exist in regard to hierarchical aspects, e.g., the relation between different competencies or the existing of more global competencies or less general or more specific ones.

2.2 Ability/trait vs. context/situation/task

In language testing, we want to measure a trait, an ability, or a construct of underlying traits (often language tests aim on course syllabi), which are related to each other and interact in a complex way, and which interact within the context of language use. But how can we grasp the construct, namely language ability or competence51? This is a core question in language testing because what we aim to measure is unalterably linked to validity and validation.52 As Zhāng Kǎi (2006c) states:

[I]n language testing, issues of validity and (language) competence both are two sides of one problem. (p. 5).53

Unfortunately, language competence is a latent ability or latent trait54, which is not directly observable. Crocker and Algina (1986) say that “psychological attributes cannot be measured directly; they are constructs” (p. 4), which means that we can ← 30 | 31 → merely measure this construct through the observation of the performance of a person (Bachman, 1990, p. 19; Chén Hóng, [1997c] 2006, p. 225; Grotjahn, 2000, p. 306, 2003, p. 8; Zhāng Kǎi, 2006a). So, what is a trait generally speaking? Messick’s (1989b) definition seems to me to be the most comprehensive:

A trait is a relatively stable characteristic of a person—an attribute, enduring process, or disposition—which is consistently manifested to some degree when relevant, despite considerable variation in the range of settings and circumstances. … A trait is an enduring personal characteristic hypothesized to underlie a type of stable behavioral consistency. (ibid., p. 15; emphasis added)

This statement includes everything on which language testing researchers based their interpretations or views of language competence; in addition, it involves the central issue why language testers and researchers in language acquisition and applied linguistics have tried to define the notion of language proficiency for more than half a century without developing a unanimous concept. Bachman (2007) identifies the main problem in specifying the construct for language assessment:

Understanding the roles of abilities and contexts, and the interactions between these as they affect the performance on language assessment tasks, has remained a persistent problem in language assessment. (ibid., p. 41; emphasis added)

Conceptions of the construct of language competence can develop from only two opposing sides, namely ability (trait) or context (task)55 of language use. Psychometricians are interested in behavioral consistencies, which are often denominated performance consistencies in language testing (e.g., Chapelle, 1998). But what is more important for these performance consistencies, ability or context? Not surprisingly, some scientists consider performance mainly as a manifestation of the trait (ability), whereas others see these contexts or situations—in Messick’s terminology “environmental contingencies” (1989b)—as the major factor. A third and last group holds a viewpoint between the two described above, “attributing some behavioral consistencies to traits56, some to situational factors, and some to interactions between them” (Messick, 1989b, p. 15). This applies probably best to the latest concepts of language ability. Bachman (2007) refers to the three different positions in the field of language testing with regard to the construct as (a) trait/ability-focused, (b) task/context-focused, and (c) interaction-focused (pp. 41–42).57 The trait/ability-focused approach, also called “skills and component framework” (Bachman, 1990, p. 4) or “skills and elements model” (Bachman, 2007, p. 46), distinguishes between components of knowledge (vocabulary, grammar, phonology, graphology, etc.) and skills (listening, speaking, reading, and writing). It attempts to identify critical features ← 31 | 32 → of the language (Bachman, 1990, p. 34). However, this approach, introduced by Lado (1961) and Carroll (1961), does not describe the relation between knowledge and components. Another limitation concerns the fact that this model does not recognize the full context of language use (Bachman, 1990, p. 82). This approach had a huge impact on language testing, for instance on large-scale assessments such as the Test of English as a Foreign Language (TOEFL; Educational Testing Service), the Michigan Test of English Language Proficiency (English Language Institute, University of Michigan), and other language tests (Bachman, 2007, p. 47). The idea of this approach is to break down language into its basic elements, which “anatomizes” the language. In a second step, these elements have to be weighted according to their difficulty, relevance and number of occurrence into more difficult and easier elements, respectively more relevant and more irrelevant elements. This approach was central for the design and development of the HSK, and it seems that it has a certain impact on proficiency testing in CFL in general. The task/context-focused approach was also called direct testing58 approach or performance assessment 59 (e.g., Clark, 1972, 1975; Jones, 1979, 1985a, 1985b; Wesche, 1987). This approach tries to sample real-life language use or tasks, meaning that tasks in language tests should as much as possible resemble real-life language use and be “authentic.”60 As an example, direct testing advocated amongst other things for face-to-face interviews, and it was a countermovement to discrete-point tests61, the latter had been advocated by Lado. The nature of the test tasks within a specific context were in the middle of interest (Bachman, 2007, p. 48). In this vein, such tests try to predict future performance of test candidates in similar situations. The interactional approach stresses the interaction between traits and contexts, with a component controlling the interaction between trait and context ← 32 | 33 → (Chapelle, 1998, p. 44 and 58). According to Chapelle (1998), this component is comparable to Bachman’s (1990) strategic competence and to Bachman and Palmer’s (1996) metacognitive strategies (Chapelle, 1998, p. 44). For He and Young (1998), interactional competence consist of abilities that are “jointly constructed by all participants” (p. 5; italics in original), although they seem not always to be clear whether sometimes the individual participants bring interactional competence to an interactional practice (Bachman, 2007, p. 60). Chalhoub-Deville (2003) describes this approach with the term “ability-in-individual-in-context” which stands for “the claim that the ability components that a language user brings to the situation or context interact with situational facets to change those facets as well as to be changed by them” (p. 372).

2.3 Language proficiency in CFL

A central question is how the terms language proficiency and communicative language ability are understood by Chinese authors engaged in proficiency testing62 for CFL. If one analyzes the worldwide proficiency debate, which is most notably influenced by researchers from English-speaking countries, it is not surprising that one of the leading experts for testing in CFL, Zhāng Kǎi (2006c), acknowledges the major significance and the essentiality of the above-mentioned issues right at the beginning of the preliminary summary of his edited volume, entitled Yǔyán Cèshì Lǐlùn jí Hànyǔ Cèshì Yánjiū语言测试理论及汉语测试研究 [Language testing theory and Chinese language testing research]. There, he does not dissemble that language testing is an extremely complex undertaking, which just might sound trivial at the beginning:

What language tests want to measure is, commonly speaking, the so-called language ability or some integral part of that ability … This sounds easy, but if one wants to investigate it more thoroughly, there will be more problems.

When looking at language tests, the issue of validity means whether a test assesses—or to what extent it assesses this so-called language ability. But if one wants to know whether it measures this specific ability, we have at first to know what this so-called language ability is. (Zhāng Kǎi, 2006c, p. 5)



He notes that concepts of language competence viewed by various researchers are “diverse and confused.”63 Chén Hóng ([1997b] 2006) shares this perspective by saying “diverse language theories have different definitions on language competence” ← 33 | 34 → (p. 210); the main question is defining language competence (p. 210). Zhāng Kǎi (2006c) also indicates that the term yǔyán nénglì语言能力 (language competence/ability/proficiency) has been used in many different ways by various authors64 and that this term––especially in English research literature––varies widely. Not surprisingly, the term proficiency can also differ in Chinese. Some authors translate proficiency as nénglì能力, while others use the term shuǐpíng水平 (Shèng Yán, 1990, p. 336; quoted by Zhāng Kǎi, [1994] 1997, p. 334). Zhāng Kǎi ([1995] 1997) underscores the issue of using shuǐpíng水平 because in language testing a single standard reference (biāozhǔn cānzhào标准参照) for “proficiency” does not exist (p. 41), and he defines what he understands by the traditional notion of language ability, which plays a very important role on the HSK:

From a traditional (point of) view, so-called language competence (ability/capacity) is the sum of language knowledge (phonetics, lexis, syntax, etc.) and language skills (listening, speaking, reading, writing, etc.). However, for the term proficiency one could also add the factor of “the degree of fluency.” (Zhāng Kǎi, 2006c, p. 6)

从传统的观点看,所谓的语言能力( ability/capacity )就是语言知识(语音、词汇、语法等)和语言技能(听、说、读、写等)的总和,而 proficiency 里可能又增加了流利程度这样一个因素。(Zhāng Kǎi, 2006c, p. 6)

The traditional view purely follows the skills and component approach. Although Chinese HSK experts used the “four skills” terminology, they were aware these skills could not be completely isolated from each other (Liú Yīnglín, [1988a] 1994, p. 36; Wú Yǒngyì and Yú Zǐliàng, 1994, p. 69).65 In my opinion, adding the marker “the degree of fluency” does not substantially change the meaning. Zhāng Kǎi also mentions Chomsky’s (1965) distinction between competence and performance and he cites Hymes’ (1974) criticism of Chomsky. Hymes (1974) invented the term communicative competence (jiāojì nénglì交际能力). Furthermore, Zhāng Kǎi mentions some of the most important models of communicative competence (e.g., Canale and Swain, 1980; Verhoeven and Vermeer, 1992) and acknowledges that Bachman’s model is the most influential.66 Concerning this matter, Zhāng Kǎi describes the core issue in language testing:

Although a great deal of research dealing with language ability and communicative ability has emerged in China and abroad, the understanding of the concepts which people bring towards competence and with that closely related performance, differ substantially… By the time this problem emerged in language testing, it became the issue of validity … Language ability is a latent ability, and it cannot be observed per se. The only ← 34 | 35 → thing we can observe is its performance (part). If one wants to know whether a test assesses this latent ability, it is a question of construct validity. (Zhāng Kǎi, 2006c, p. 7; italics added)

虽然国内国外对语言能力和交际能力的研究大量出现,但是人们对 competence 以及与之密切相关的 performance 概念的理解很不相同。(略)当这个问题出现在语言测试里时,它就变成了效度问题了。(略)语言能力是一种潜在的能力,它本身是观察不到的,我们能够观察到的只是它的(部分)表现。要想知道一个测验是否测到了这种潜在能力,这就是构想效度问题。

Here, two points are crucial (cf. section 2.1 and 2.2): the concept of language ability is constructed and this ability can merely be observed indirectly because it consists of an underlying ability or trait. Therefore, construct validity has to be investigated if one wants to know whether a language test measures this latent ability, and our measurement is the test takers’ performance.67 The use of terms or labels “communicative” competence or “communicative” language ability per se does not provide additional insights. We still have to construct what we want to measure. This aspect will be discussed in the sections 3.3 and 3.4, and then applied to the HSK in chapter 4. However, Chinese authors describe the construct as a so-called “black box” (hēi xiāng黑箱; Zhāng Kǎi [1995] 1997, p. 42), and Zhāng Kǎi ([1995] 1997) emphasizes (cf. Bachman, 2007):

It does not matter from which definition one starts: the language testing construct is still not clear, even today. (Zhāng Kǎi, [1995] 1997, p. 42)


So, what does the HSK aim to test? Zhào Jīnmíng (2006) explains that the purpose of the HSK is to assess test takers’ Chinese proficiency (Hànyǔ shuǐpíng汉语水平). In particular, it should measure the Chinese language ability needed for studying in China (cf. chapter 4). Zhào Jīnmíng calls the HSK a zhǔgàn kǎoshì主干考试 (“mainstay” examination), which is at the heart of Chinese proficiency. In the HSK concept, the mainstay HSK is accompanied by four “branch” examinations (fēnzhī kǎoshì分支考试), which were designed for the use of Chinese language for other purposes (ibid., pp. 23–24).68 This reveals that the connection of language to context was well known among HSK developers and researchers. Otherwise, if there was just one single “general language proficiency,” why would tests for other ← 35 | 36 → purposes be developed? On the other hand, the HSK concept also reflects the belief that there is an overall language proficiency core (yìbān yǔyán nénglì一般语言能力, Wáng Jímín [2002] 2006, p. 53; zhěngtǐ yǔyán shuiping整体语言水平, Wú Yǒngyì and Yú Zǐliàng, [1993] 1994, p. 338), bound to the concept of academic Chinese language. As a mainstay examination, the old HSK was intended to measure Chinese in a broader variety of contexts.69

Wáng Jímín ([2002] 2006) wrote one article about language competence research outside of China, in which she shifted this knowledge into Chinese research and where she presented a detailed overview of the historical developments in the theory of language competence. There, she detected three concepts: the aforementioned skills and component model (jìnéng/chéngfèn móxíng技能/成分模型), a period of unified approaches (yīyuánhuà jiēduàn一元化阶段) and models that added communicative competence. She claims that the huge advantage of discrete-point tests (fēnlìshì cèyàn分立式测验) is their objectivity (ibid., pp. 48–49), misunderstanding that multiple-choice tests are not necessarily discrete-point tests.70 She cites Bachman’s (1990) critique of the ability model, who said that this model failed to sufficiently recognize the context of language use and she meticulously depicts Oller’s pragmatic language testing approach (1983)by emphasizing that he underscored the use of context in language testing, although Oller’s unitary competence hypothesis finally proved to be incorrect (Bachman, 2007; Oller, 1983; Wáng Jímín, [2002] 2006, p. 55).71 Finally, Wáng Jímín stresses the importance of models of communicative testing (Bachman, 1990; Canale, and Swain, 1980), and she cites Skehan (1991), who evaluated Bachman’s communicative language ability approach (CLA; 1990) as a milestone for language testing. One of the major achievements in this model was that Bachman expanded the notion of strategic competence72 from a pure compensation strategy to one that underlay all language use (Bachman, 2007, p. 54). Bachman’s model of communicative language ability together with his and Palmer’s (1996) concept of usefulness have been transferred ← 36 | 37 → into Chinese research by various authors.73 Wáng Jímín’s final appraisal appears superficial, merely saying that (a) discrete-point testing still has its place in language testing today74, and, (b) Oller’s work stimulated new research about the nature of language competence. In addition, Wáng Jímín([2002] 2006) praises the practical worth of Bachman and Palmer’s CLA model, that links perspectives of linguistics, social linguistics, and psychology (p. 62), and she states that the CLA model was one important theoretical basis for further developing the new TOEFL; however, her findings are not related to the construct of the HSK.75

From the Chinese research, the decisive point in language testing is the definition of the construct. Chén Hóng ([1997b] 2006) highlights the pivotal issue of how to specify the nature of the trait, which has to be defined by a construct:

In the fields of psychological and educational measurement, researchers … estimate the extent to which a test taker possesses or performs some sort of psychological trait. In all research dealing with validity, we face a fundamental problem, namely: What is the nature of this psychological trait? … In language proficiency testing, the psychological trait that has to be estimated usually refers to some specific language ability, which is why the construct can be understood as the definition of this kind of language competence.

在心理和教育测量领域,研究人员(略)估计被试具备或表现出多少某种心理特质( trait )。在一切有关效度的研究中,我们面临的一个基本的问题就是,这种心理特质的性质是什么?(略)在语言能力测验中,被估计的心理特质一般指某种语言能力的定义。(Chén Hóng, [1997b] 2006, pp. 200–201)

He emphasizes the fact that the construct is “purely theoretical” (chún lǐlùnde纯理论的); therefore, we have to state hypotheses or assumptions (jiǎshè假设) about it. The theoretical tie between the ability and observed performance (guānchá de biǎoxiàn观察的表现)enables us to make inferences (zuòchutuīduàn作出推断) about test takers’ language competences (ibid.; Zhāng Kǎi, 2006c, p. 7).

Chén Hóng ([1997b] 2006) realizes that structuralistic perceptions of Lado (1961) and Carroll (1961) still influence language testing, which can be seen when looking at the (paper-based) TOEFL, in which “the format, score composing, score interpretations and the methods in which items are designed––all had left scars of structuralism” (p. 212). Furthermore, he criticizes the HSK for being theoretically based on models of communicative competence, but in practice was still following Lado’s old definition of language competence (pp. 212–213). In his criticism, he ← 37 | 38 → points also to the lack of authenticity76 (p. 219) in language assessment in general, and decries the widespread use of discrete-point tests in high-stakes language testing (p. 222).

In another article about validation and construct validity, Chén Hóng ([1999] 2006) refers to the fundamental question in language testing—the relation between ability and language behavior (context; cf. section 2.2):

However, language test developers initially have to face the following questions: What is the nature of language ability? Which aspects does language ability include? … In addition, how are language ability and language behavior related to each other? (ibid., p. 248; italics added)


Chén Hóng adds a new point to the Chinese discussion, stressing correctly that how we perceive language ability is mainly based on theoretical assumptions, which to a certain degree have to involve subjectivity ([1999] 2006, p. 249). This is one of the seldom moments in Chinese CFL literature when an author explicitly alludes to the limitations of language proficiency tests.77 Moreover, this unmistakably reveals that language assessment is bound to values that are extremely crucial for modern concepts of validity and validation (e.g., Messick, 1989b). Such concepts do not merely include values; they explicitly claim to integrate value assumptions by identifying where and when they influence testing. Therefore, values are also an integral part of the validation concept of this work (cf. section 3.3 and 3.4, and chapter 4 and 5). Furthermore, Chén Hóng ([1999] 2006) discusses the role of the ability/trait approach, criticizing Lado for not resolving the issue of how language ability is related to language performance. According to Chén Hóng, a persisting core problem in language testing is how to operationalize the relation between ability and context (ibid., p. 250). While saying that Lado’s influence on large-scale language tests is still extremely vital (p. 252), Chén Hóng turns to the construction and conception of language proficiency on which the HSK is based. He indicates that the HSK originally tried to focus on the concept of communicative competence:

In the early developmental experiment period of the HSK, [the HSK] already included concepts of communicative ability, and defined it as the utilized/applied ability of Chinese language within specific social and cultural contexts. (Chén Hóng, [1999] 2006, p. 252)

HSK 在其早期研制试验阶段,已经引入了交际能力的概念并将其定义为在一定的社会文化情境中对汉语的运用能力。

However, in the official HSK documents (e.g., Liú Xún et al., [1986] 2006) the construct merely comprised two aspects, namely the four skills and language ← 38 | 39 → knowledge (phonetics, lexicon, etc.). Chén Hóng argues that this perception of language proficiency still dominated in 1999, and he concludes that the HSK mainly followed Lado’s skills and components approach, although it actually should not have done so. Chén Hóng’s findings were also supported by a study undertaken by Guō Shùjūn ([1995] 2006). The latter revealed that the HSK—at least in part—still did not reach its original aim to measure communicative competence (especially the grammar section).78 Finally, Chén Hóng concluded that the “HSK in theory is still not conscious and ripe enough in regard to the problem of the construct of language competence” ([1999] 2006, p. 253).79 However, he noted that this problem did not only occur in the HSK, and he claimed that it also presented a problem of other language proficiency tests for other foreign languages mainly using multiple-choice tests items.

With regard to how HSK test developers envisioned the construct of Chinese language proficiency, a detailed look at the early documents that underlay the construction and were published during the initial development stage of the HSK can present informative insight. For instance, Liú Xún et al. ([1986] 2006) stresses that the use of communicative language also includes more than the pure form of the language. One also has to take situational factors into account, which are largely influenced by society, and Liú Xún et al. already mention the use of communicative strategies (huìhuà cèlüè会话策略, p. 11; cf. Bachman, 1990). In their essay, we can find the primary goals of the HSK, and they involve important statements revealing how the construct was shaped on the theoretical level:

The form of the language and the social functions of the language have to be organically united into didactics. To correctly treat the relation between language ability and communicative language competence to achieve the final goal of relatively comprehensively fostering communicative language ability––this is the foundation of the HSK design. (Liú Xún et al., [1986] 2006, p. 12; italics added)


According to Liú Xún et al. the form of the language (components) needs to be connected with context (the social use of language), and the overall goal is for learners to develop communicative language ability. Liú Yīnglín ([1988a] 1994, p. 35), who says that language consists of “form” (xíng), and “meaning” or “intention” (), also supports this teaching goal. The form refers to the structure of the language (yǔyán jiégòu语言结构), and the meaning to the functions (gōngnéng功能) and to the cultural background of the language (wénhuà bèijǐng文化背景). ← 39 | 40 →

[T]he main focus of the HSK should be to examine communicative competence … When entering university and studying in a department, the important channels for achieving knowledge are receptive listening and reading abilities … If [they] do not possess a certain listening comprehension ability, [they] cannot listen and understand classes. If [they] do not possess certain reading comprehension ability, [they] cannot read. At the same time, in addition to assess listening and reading comprehension abilities, oral and written productive abilities must be properly tested because they [the students] live in China, and they have to engage in normal social interaction, cope with daily oral conversation, and raise and answer questions. Furthermore, they have to be able to take notes, do homework and write experimental reports, and common letters/written messages, notes, and so forth. Therefore, the stress in testing exactly focuses on listening, speaking, reading, writing, and other aspects of communicative competence. (Liú Xún et al., [1986] 2006, p. 13)


On the one hand, this statement shows that the focus of language use was on receptive skills. Indeed, receptive skills are certainly required basics if someone wants to study a subject at a Chinese university successfully. However, the productive use of the language is also mentioned; therefore, productive skills should have been assessed (to a certain extent) as well.80

In addition, … in connection with the actual proficiency of the majority of test takers, they are still far from a relatively high level. In proficiency testing, a specific amount of basic vocabulary, commonly used sentence structures, Chinese characters, and other linguistic components are included for understanding the extent of mastery of their [the test takers’] linguistic basic knowledge. This is also absolutely necessary. (Liú Xún et al., [1986] 2006, pp. 13–14)


In my opinion, this last statement is very important for CFL proficiency testing and for understanding how the old HSK was modeled at its core. HSK designers argued that many test takers did not have a very high level in Chinese proficiency. In turn, this means that learners often just use pieces or fragments of the Chinese language, far from fluent, competent use. Therefore, one aim was to assess the amount and mastery of these language components. ← 40 | 41 →

In 1990, Liú Yīnglín confirmed that the pure form of the language is not enough for understanding its nature and he stated that in CFL teaching, the construct of language81 was connected to the function of the language (yǔyángōngnéng语言功能) by stressing that cultural aspects should be included (Liú Yīnglín, [1990c] 1994, p. 7). In addition, he mentioned the connection between the construct of the language and CFL didactics, which he called jiàoxuéfǎ tǐ教学法体系 (ibid.). This is important because modern Western conceptions of language proficiency tests usually do not take didactics into consideration (Bachman, 1990; Bachman and Palmer, 2010). So, two aspects were important for the language construct of the HSK: basic language knowledge (yǔyán jīchǔ zhīshi语言基础知识) and communicative language competence (yǔyán jiāojì nénglì语言交际能力)or simply language competence (yǔyán nénglì语言能力).82 As shown later in section 4.1 (Target Domain) and 4.5 (Explanation), this concept also influenced the design of the test sheet (juànmiàn gòuchéng卷面构成) and the weighting of the four subtests of the Elementary-Intermediate HSK, as well as what they were intended to measure (Zhāng Kǎi, [1994] 2006, p. 198).

This section shows that Chinese CFL scholars had discussed the theoretical issues fundamental for language testing, and it foreshadows how these considerations also influenced specific parts of the HSK design, like several critiques of the HSK’s construct validity indicate. The construct was considered fundamental, and Chinese researchers were aware that language tests are value laden, which means that from an epistemological viewpoint, language tests cannot be “objective83.” Besides, underlying theoretical assumptions in test design must inevitably lead to a certain degree of subjectivity and limitations in the use of language tests.84

2.4 Current views of language proficiency

The three different viewpoints mentioned by Bachman (2007) in section 2.2 are crucial for defining the construct we intend to measure, and they simultaneously influence our research questions, how we design empirical investigations, and how we interpret and use assessment results (Bachman, 2007, p. 41). Bachman (1990) phrased this core issue in the following statement: ← 41 | 42 →

[L]anguage is both the object and instrument of our measurement. That is, we must use language itself as a measurement instrument in order to observe its use to measure language ability … this makes it extremely difficult to distinguish the language abilities we want to measure from the method factors used to elicit the language. (pp. 287–288)

Skehan (1998, p. 155) sees the crux in the “abilities/performance/context conflict,” which was also addressed in CFL research, for instance by Chén Hóng ([1999] 2006, p. 248). If the researcher focuses too much on abilities, he will neglect performance and context. Sharpening the focus on performance or context will weaken the two remaining elements. In language proficiency tests, we are interested in a wide array of contexts. Hence, the crucial question is whether ability or context is mainly responsible for the performance of test takers?

Bachman (2007) distinguishes seven different approaches for the construct of language proficiency85 developed within the last half century. However, the most important finding is that he assigns every approach either as mainly focusing on the ability/trait, or the focus task/context except the variation of the seventh approach86, which tries to concentrate mainly on the interaction between ability and context. Bachman (2007) concludes that “from the early 1960s, we can see a dialectic between a focus on language ability as the construct of interest and a focus on task or context as the construct” (p. 43). Chalhoub-Deville’s moderate interactionalist approach, called “ability-in-individual-in-context,” seems to equally rely on ability/trait and task/context because it interacts between them.87 However, it does not help very much to concretely operationalize the construct for language tests. ← 42 | 43 →

According to Bachman, relating a specific approach more to the ability or to the task, or focusing on the interaction between both, is purely a question of values and assumptions. Therefore, he concludes that “the conundrum of ability and context and how they interact is, in my view, essentially a straw issue, and may not be resolvable at that level” (Bachman, 2007, p. 70). In his final appraisal of this issue, he states that these three “different” approaches are not mutually exclusive (ibid., p. 41); on the contrary, they should all be considered in “design, development, and use of language assessments” (ibid., p. 71).

2.5 Approach for this work

A single best way to define the construct for language proficiency does not exist (e.g., Bachman, 1990, 2005, 2007; Bachman and Palmer, 1996, 2010; Canale and Swain, 1980; Chapelle, 1998; Chalhoub-Deville, 1997, 2001; Chén Hóng, [1997b] 2006, [1999] 2006; Grotjahn, 2003; Liú Xún et al., [1986] 2006; Liú Yīnglín, [1990c] 1994; McNamara, 1996;Oller, 1979; Wáng Jímín, [2002] 2006; Zhāng Kǎi, 2006c)88, and many experts agree that using language for a specific purpose has to be integrated into the construct, which means that linguistic knowledge has to be combined with strategies that help language users achieve their communicative goals in specific situations (Bachman, 1990, 2007; Bachman and Palmer, 2010). Chapelle et al. call this “the ability to use a complex of knowledge and processes to achieve particular goals” (Chapelle et al., 2008, p. 2). Thus, language knowledge is not irrelevant, but it is not sufficient. Rather, it has to be embedded into the context of language use, which affects the nature of language ability. Grotjahn (2003) underscores that communicative models of language proficiency do not merely consist of declarative knowledge; they also comprise procedural knowledge and the ability to use language automatically (automatisiert) as a skill (pp. 8 and 11).

Although it is not possible to grasp the construct concretely, the Standards for educational and psychological testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1985, 1999), or simply Standards, place the construct at the heart of validity and validation (although in the Standards the term construct is not synonymously used with latent trait). They use the term to refer to “the concept or characteristic that a test is designed to measure” (1999, p. 5), and they support this definition of the construct for psychological and educational testing. This approach was widely accepted in the early 1990s in educational measurement. For example, amongst others Messick (1994) stated that a theoretical construct should serve as the basis for score interpretation in large-scale, high-stakes testing. Therefore, the idea that a construct for language proficiency should build the foundation for test development and should simultaneously provide a framework for validation was widespread among language assessment experts as well (e.g., Alderson, Clapham, ← 43 | 44 → and Wall, 1995; Chén Hóng, [1997b] 2006, p. 201). Bachman and Palmer (1996) included construct validity into their model of usefulness, where it is the second component (p. 18), and which is also designed for test validation;89 in addition, they placed it at the fourth position in the test development process (pp. 115–132). Chapelle et al. (2010) described this dilemma in the following statement:

Despite agreement on the need to define the construct as a basis of test development, no agreement exists concerning a single best way to define constructs of language proficiency to serve as a defensible basis for score interpretation (e.g., Bachman, 1990; Bachman & Palmer, 1996; Chapelle, 1998; Chalhoub-Deville, 1997, 2001; Oller, 1979; McNamara, 1996). Nevertheless, most would agree that limiting a construct of language proficiency to a trait such as knowledge of vocabulary or listening is too narrow for the interpretations that test users want to make for university admission decisions. Instead, test users are typically interested in examinees’ ability to use complex of knowledge and processes to achieve particular goals. Therefore, strategies or processes of language use have been included in constructs of language proficiency, called communicative competence (Canale & Swain, 1980) or communicative language ability (Bachman, 1990). (ibid., p. 4)

Language proficiency has to be specified by the context because

[A] conceptualization of language proficiency that recognizes one trait (or even a complex of abilities) as responsible for performance across all contexts fails to account for the variation in performance observed across these different contexts of language use (Bachman & Palmer, 1996; Chalhoub-Deville, 1997; Chapelle, 1998; McNamara, 1996; Norris, Brown, Hudson, & Bonk, 2002; Skehan, 1998). As a result, language proficiency constructs of interest are difficult to define in a precise way … (Chapelle et al., 2010, p. 4)

The specific linguistic knowledge and the strategies required to accomplish goals depend on the context in which language performance takes place (Chapelle et al., 2008, p. 2) because we can find variation in performance according to the specific context. On the one hand, we want to include the context because it is a distinct feature of language use; on the other hand, we are interested in predicting performance in many contexts. This results in the construct becoming too complex because it varies too much from context to context or, in Messick’s words because of a considerably strong “variation in the range of settings and circumstances” (Messick, 1989b, p. 15).90 For the test validation process, one possible solution that has been adopted in the past years is the argument-based approach to validity, which is based on an interpretive argument. The construct still plays an important role, but it is, in itself, not the framework for the validation process. The argument-based approach will be laid out in detail in section 3.4. Here, only an overview will be provided. The structure of an interpretive argument is rather simple: it is based ← 44 | 45 → on grounds91 that lead to a claim. In other words, we are observing the behavior of a person, namely the behavioral consistencies of a test taker. Then, we draw inferences that result in a conclusion or claim about the behavior of the person. The inference has to be justified by a warrant, which is a general statement that provides legitimacy of a particular step in the argument (Toulmin, 2003, p. 92). The warrant itself is based on a backing, which generally comes, in language testing, from “a theory, prior research or experience, or evidence collected specifically as part of the validation process” (Bachman, 2005, p. 10). In principle, the inference depends on assumptions, and these assumptions have to be justified. The counterpart to the warrant is a rebuttal, which tries to weaken the inference. Rebuttals are alternative explanations, or counterclaims (Bachman, 2005, p. 10). This approach is enormously useful because all test-score interpretations involve an interpretive argument, starting from the score and ending with decisions or conclusions. When we validate a test-score interpretation, we “support the plausibility of the corresponding interpretive argument with appropriate evidence” (Kane, 1992, p. 527). The mere observation of the student’s performance is not enough for making a claim. We need an interpretive argument which specifies the interpretation drawn from the grounds to a claim by an inference (Chapelle et al., 2010, p. 5). Test developers and researchers have to identify specific inferences upon which the score interpretation is based. This identification finally leads to an inferential chain, and the purpose of the validity argument is the evaluation of this interpretive argument (Chapelle et al., 2010, p. 5). The major advantage of presenting an interpretive argument as a framework for validation can be summarized as following:

This … illustrates the basic approach to an interpretive argument that states the basis of an interpretation without defining a construct. Rather than terms referring to examinee characteristics, the tools of the interpretive argument are inferences that are typically made in the process of measurement. … If test developers can work with a set of such inferences rather than solely with the complex constructs that can be defined in many different ways, the basis for score interpretation becomes more manageable. (Chapelle et al., 2010, p. 5)

This approach does not solve all issues, and it does not obviate the need to include investigations of construct validity in the whole validation process. But the validation process itself no longer rests preferentially on the construct, and the construct in turn has become a part of the inferential chain. That is why an argument-based approach offers the possibility to integrate huge amount of various validity evidence collected by researchers; at the same time, one can arrange this evidence within a logical structure that supports the interpretation of test scores with regard to the specific test use.

36 Jackson and Kaplan (2001) say that staff members of the Foreign Service Institute (FSI) of the U.S. Department of State first used the term language proficiency in the late 1950s (p. 72). In the FSI, it is understood as “the ability to use language as a tool to get things done” (Jackson and Kaplan, 2001, p. 72).

37 Vollmer (1981) says that “[l]anguage proficiency is what language tests measure” (p. 152).

38 The labeling of a test or of a trait which the test is intended to assess is often associated with specific preconceived notions within a society. Many trait labels had been long in use before anyone decided to measure them (Bruner, 1990). Language proficiency, language competence or language ability are all excellent examples for this phenomenon. As Cook and Campbell (1979) underscored, test developers “like to give generalized abstract names to variables” (p. 38), and “trait labels may make implicit claims that the trait can be interpreted more broadly” (Kane, 2006, p. 32). Hence, the labeling of a trait involves values and assumptions about the trait (ibid.).

39 There is no clear distinction between these terms. The term “test” refers to an instrument, while the terms “testing” or “assessment” are more appropriate for the activity (cf. AERA, APA, and NCME, 1999, p. 3). According to Bachman and Palmer (2010), some authors understand “tests” as formal, and “assessments” as informal, without any further specification. Others distinguish between “plain tests” and “alternative assessments,” or “performance assessments.” The latter are believed to be more “authentic” or “real-life-like” than tests (p. 19), but terms such as “authentic” or “real-life-like” are highly debatable in themselves, mainly because they are value laden.

40 Bachman (1990, 2004b) and Bachman and Palmer (2010) distinguish between evaluation and assessment. For Bachman evaluation takes place when the momentum of making a decision is part of (or after) the assessment (Bachman, 1990, p. 22). Bachman and Palmer argue that evaluation involves making value judgments, and it has something to do with the test purpose (Bachman and Palmer, 2010, p. 21); Nitko proposes the same argument (1983, p. 7). In my view, this distinction is possible, but unnecessary because (a) value judgments are part of the whole assessment process, and (b) the term evaluation in such a limited sense might be misleading. For the step of making a decision I simply prefer the terms “decision” or “decision-making” (cf. Kane, 2006).

41 Many authors have stressed quantifying behavior when measuring a person’s psychological attributes, e.g., Crocker and Algina (1986, p. 5). According to Carroll (1968) a test is a procedure designed to elicit certain behavior from which one can make inferences about certain characteristics of an individual (Carroll, 1968, p. 46; quoted in Bachman, 1990, p. 20).

42 Ability in this context is commonly translated as nénglì能力 into Chinese.

43 Taylor (1988) says “the ability to make use of competence” (p. 166).

44 Many applied linguistics view language ability as consisting of two components––language knowledge, sometimes called “competence,” and cognitive processes or “strategies” (Bachman and Palmer, 2010, p. 57). Here, Bachman and Palmer seem to mix up the two components language ability consists of, compared with the former statement.

45 This is a crucial point in designing language proficiency tests and in interpreting scores of such tests. The question is to what extent scores of a proficiency test can be used to draw inferences that facilitate predictions about the future behavior or performance of the test taker? Test scores are always a sample of situations, and we want to know how far we are able to generalize to other contexts. This issue makes the term so problematic. Morrow (1979/1981), for instance, says that language proficiency describes “how successful the candidate is likely to be as a user of the language in some general sense” (p. 18). Such vagueness is also expressed by Ingram (1985), who similarly notes that “what is meant when we say that someone is proficient in a language is that a person can do certain things in that language” (p. 220). This key issue is one important point in the construction of the HSK.

46 The TestDaF was used for the first time in 2001. It assesses the German language ability of foreigners who want to apply to a German university (www.testdaf.de).

47 According to Cummins (1983), second language learners can be proficient in some contexts but they lack proficiency in other contexts. Bachman (2004) admits that the term language proficiency has “a common core of general meaning for most people” (p. 14).

48 The notion of some sort of language proficiency across all contexts is expressed by the terms “general language ability” (Lowe, 1988, p. 12) or “overall language proficiency” (Spolsky et al., 1968). This concept is also part of the ACTFL Proficiency Guidelines (American Council on the Teaching of Foreign Languages, 1986). Spolsky et al. (1968) said: “What does it mean to know a language? … We are using the overall approach when we attempt to give a subjective evaluation of the proficiency of a foreign speaker of our language. In such cases, we usually do not refer to specific strengths and weaknesses, but rather to his overall ability to function in a given situation… Overall language proficiency is more usefully regarded as the ability to function in natural language situations. We do not say that someone knows a language when he can produce one or two sounds, or repeat one or two sentences, or translate one or two words, but when he is able to communicate with others and understand what he hears or reads” (pp. 79–80).

49 The debate on the existence of “general language proficiency” led to great confusion in terminology. McNamara (1996, p. 51) put it aptly: “Other instances of statements, claims and definitions which can only result in confusion for the reader are not hard to find. For example, according to Richards (1985: 4), ‘Language proficiency is hence not a global factor.’ But according to Alderson et al. (1987: iv), ‘Proficiency is a global construct.’”

50 There is a German and an English scientific discourse on the term competence.

51 Language ability or competence are translated into Chinese as yǔyán nénglì语言能力.

52 These two terms will be referred to in more detail in chapter 3.

53 (略)在语言测验中,效度问题和能力问题是一个问题的两面。

54 A latent trait in Chinese terminology is called qiánzài de tèzhì潜在的特质 or sometimes qiánzài de nénglì潜在的能力. Crocker and Algina (1986) state: “[P]sychological attributes cannot be measured directly; they are constructs” (p. 4).

55 In the Chinese literature, these sides are called nénglì能力 (for ability) and biǎoxiàn表现 (for task/performance); the latter is often associated with the term xíngwéi行为 (behavior).

56 Trait can be correlated to ability here.

57 Skehan (1998) uses the terms construct-based (for ability-focused) and task-based (for task-focused).

58 Messick (1996) calls the term direct assessment a “misnomer because it always promises too much,” and he cites Guilford who says “all measurements are indirect in one sense or another” (Guilford, 1936, p. 5; qtd. in Messick, 1996, p. 244).

59 McNamara (1996) distinguishes between performance tests in a strong and in a weak sense. The latter tests do not belong to this group. The term performance assessment is problematic because in any test we assess the performance of test candidates, even if the performance is to fill out a paper-and-pencil multiple-choice test, whose performance does not resemble very much real-life language use performance. Fitzpatrick and Morrison (1971) suggest: “There is no absolute distinction between performance tests and other classes of tests—the performance test is one that is relatively realistic” (p. 238).

60 The term “authentic” in language assessment is also very problematic because as Spolsky already declared, “[a]ny language test is by its very nature inauthentic” (Spolsky, 1985, p. 39), which means that a specific test situation––irrespective of how well it resembles the target domain the test tries to replicate––will always have distinct characteristics of a test situation, which could or will possibly evoke (to a certain extent) “unnatural” behavior. The term authenticity is discussed in more depth in section 3.2.6.

61 Discrete-point tests (Chinese: fēnlìshì kǎoshì分立式考试) target specific, single, and isolated linguistic features (Grotjahn, 2003, p. 37; Wáng Jímín, [2002] 2006, p. 49).

62 Proficiency test(ing) can be translated into Chinese as nénglì cèyàn能力测验 or shuǐpíng cèshì水平测试 (cf. Shèng Yán, 1990; Sūn Déjīn, 2007, p. 129).

63 Zhòng shuō fēn yún众说纷纭 (“everyone speaks diversely and confused”; ibid., p. 5).

64 Zhāng Kǎi (2006c) points out that the terms “ability,” “capacity,” “skill,” and “proficiency” have all been used for the notion of language competence (ibid.).

65 (略)每一种技能都不完全是单纯的、独立的,而是一种复合体——复合技能。”[Each skill is not absolutely pure and independent; together, the skills are a compound system—a composite.]

66 Bachman’s influence has been thoroughly discussed among Chinese researchers, for instance by Lǚ Bìsōng (2005) and Fàn Kāitài (1992).

67 Zhāng Kǎi ([2005b] 2006) devoted an article to the performance issue, in which he underscores that this term is used inconsistently by different authors, for instance by Chomsky and Hymes. He concludes that performance should be translated with different Chinese terms, reflecting these varying concepts, and suggests translating Chomsky’s conception with biǎoxiàn表现 (“to show, to display”), and Hymes’ term with yùnyòng运用 (“to utilize, to apply”).

68 In 2006, some of these tests were still under development. Zhào Jīnmíng said the Hanban was developing a Chinese test for juveniles (HSK shào’ér 少儿), a business test (HSK shāngwù 商务), a test for secretaries (HSK wénmì 文秘), and a test for travelling (HSK lǚyóu 旅游) (Zhào Jīnmíng, 2006, p. 23).

69 In addition, Zhào Jīnmíng (2006) notes that the HSK was currently undergoing some reform, aiming for a broader use of the test, which should encourage more learners of Chinese to participate in it, finally resulting in more people learning Chinese, also in regard to the requirements of an international promotion of Chinese (p. 24). Here we find some indication for the motives in the changes brought by the new HSK in 2010.

70 Zhào Jīnmíng (2006) mixes up discrete-point tests (that can have the format of multiple-choice tests) and multiple-choice tests. A feature of the latter ones is time-efficient scoring, which can be carried out by machines, and that is why multiple-choice tests are extremely helpful for large scale tests (dàguīmò kǎoshì大规模考试), and not necessarily discrete-point tests.

71 Oller stated the hypothesis that language proficiency is essentially a single unitary ability, rather than separate skills and components, and he believed that he had identified the general factor from this empirical research, which he called “pragmatic expectancy grammar” (Bachman, 2007, p. 48).

72 Strategic competence is translated with cèlüè nénglì策略能力.

73 Amongst others, Hú Zhuànglín ([1996] 2006) provides a detailed overview.

74 This is mainly true with regard to the old HSK (especially for the Basic and the Elementary-Intermediate HSK), and partly to the new HSK. 90% of the Elementary-Intermediate HSK items (154 out of 170 items) were designed in the multiple-choice format, one possible feature of discrete-point tests.

75 In the abstract of her cited essay, she does not say how all these concepts have influenced or contributed to the construct of language proficiency, nor the old HSK test design, nor how these models could contribute to model constructs for Chinese proficiency tests.

76 The term authenticity in Chinese is zhēnshíxìng真实性.

77 See also Liú Yīnglín (1994d, pp. 204–206).

78 A more detailed description how the language construct of the HSK was investigated takes place in section 4.5 (Explanation).

79 “HSK 对于语言能力结构问题在理论上还不够自觉和成熟。

80 This target language domain description slightly resembles the CEFR self-assessment grids with its “can do-statements.” The target language domain will be described in more detail in section 4.1.

81 Liú Yīnglín used the terms “theoretical validity” (lǐlùn xiàodù理论效度) and “conceptual validity” (guānniàn xiàodù观念效度) for construct validity (ibid., [1990c] 1994, p. 7).

82 Basic language knowledge was originally called “language structure” or “language construct,” in Chinese yǔyán jiégòu语言结构 (Zhāng Kǎi, [1994] 2006, p. 198).

83 To what degree a test can be considered objective or not, is a question of how the term objective or objectivity is understood. For example, regarding the scoring aspect of a test, a test can indeed be considered to have a high objectivity of scoring or objectivity of administration (cf. section 3.2.1, p. 63).

84 This is a very important finding because superficially many test takers believed and still believe that the old (and new) HSK assessed Chinese language proficiency objectively.

85 He identifies (1) the skills and elements approach (e.g., Lado, 1961; Carroll, 1961, 1968), (2) the direct testing/performance testing approach (e.g., Clark, 1972), (3) the pragmatic language testing approach (e.g., Oller, 1979), (4) the communicative language testing approach (e.g., Canale and Swain, 1980), (5) the interaction-ability approach (e.g., Bachman and Palmer, 1996), (6a) the task-based performance assessment 1 approach (e.g., McNamara, 1996), (6b) the task-based performance assessment 2 approach (e.g., Norris et al., 1998), and (7) interactionalist approaches (a minimalist one, e.g., Kramsch, 1986; a strong one, He and Young, 1998; and a moderate one, Chalhoub-Deville, 2003).

86 Approach no. seven in his investigation is the so-called “moderate interactionalist approach” by Chalhoub-Deville (2003). It has only an indirect connection to social interaction.

87 Chalhoub-Deville (2003) raises the question whether ability lies solely inside the language user, a trait that belongs to the language user. Her opinion is that ability is co-constructed in a dynamic discourse, together with other language users (p. 372). If ability is foremost co-constructed together with other language users, it would be problematic or even impossible to draw inferences from performance, and it would finally be unfeasible to generalize how a language user would perform in other contexts, which would make language testing useless. However, the degree of co-construction depends to a large degree on the specific activity of language use. Chalhoub-Deville in the first line focuses on spoken language. In contrast, for receptive language use, the factor of co-construction becomes less important. For example, listening to a radio program will show less features of co-construction compared to an interactive dialog.

88 Chapelle et al. (2008) call this undertaking a “divisive issue” (p. 1).

89 Weir (2005) proposes an alternative of how to fit the construct into the validation process.

90 Bachman (2005) says—with an ironic undertone—that the attribute (or trait/construct) “we intend to test is what we have all come to know and love as the construct” (ibid., p. 2).

91 The term “grounds” is used by Toulmin, Rieke and Janik (1984).

| 47 →

3 Test theory for language testing

When using a test we already apply science because when we measure we are modeling the world, and we make, explicitly or implicitly, theoretical assumptions about what we assess and how we assess. This is why validation––the justification that a measurement is valid––is “scientific inquiry” (Messick, 1989b, p. 14). But according to which criteria can we say that our assessment matches our purpose? How do we know that what we have measured is indeed the construct we intend to measure? To answer these questions, one needs to look closer at measurement theory and to state hypotheses about the measurement (section 3.1). In a second step, the relevant criteria for language tests will be laid out in theory and made explicit with examples from CFL proficiency testing (section 3.2). Then, the notion of the central quality criterion, validity, will be explained in detail (section 3.3) because this concept builds the foundation for the validation approach used in this dissertation, an argument-based validation approach (section 3.4). These questions will guide this chapter:

- What models of measurement theory exist in language testing and how are they applied in CFL proficiency testing and the HSK?

- What are the crucial test qualities in language testing and what role do the play for the HSK?

- How does a contemporary concept for validity looks like, and how can such a concept be operationalized into a manageable validation procedure?

- And finally: how can such a concept be used for validating the HSK?

3.1 Classical test theory and item response theory

In testing we measure “something,” and this something in psychological testing refers to some specific characteristics of a person. Bachman (1990) says that we are “quantifying the characteristics of persons according to explicit procedures and rules” (p. 18). Thus, besides the need of a theory which tries to capture the nature of the characteristics of the subject (in our case language ability) we want to measure, we also have to state theoretical assumptions about the nature and character of our measurement. In other words, we need a test or measurement theory (Grotjahn, 2003, p. 17). One approach is the classical true score measurement theory or classical test theory (CTT), which takes into consideration that every measurement is defective (Grotjahn, 2000, p. 306, 2003, p. 18). This means that an observed score comprises two factors: a true score (the real ability of the test taker) and an error score, which is due to factors that influence the preciseness of our measurement (Bachman, 1990, p. 167). In CTT, the observed score is the sum of the true score and the error score. A second assumption in CTT is that error scores are unsystematic and uncorrelated with true scores (Bachman, 1990, p. 167), which means that after a series of measures all errors should neutralize each other and the expected mean of all measurement error becomes “zero” (R. Baker, 1997). Besides these ← 47 | 48 → qualities, CTT has one major disadvantage: it is dependent on the group of examinees (Grotjahn, 2003, p. 18; Yen and Fitzpatrick, 2006, p. 111), which means that according to the composition of the group who takes the test, important statistical values will change. For example, in a multiple-choice item the item difficulty is calculated by counting out how many test candidates chose the answer key and the percentage of all test takers who chose the key equals the item difficulty index. For some groups of test takers a specific item might be very easy, whereas for other groups the same item might be more difficult.

Item response theory (IRT), also called probabilistic test theory, tries to overcome this and other shortcomings of CTT. IRT estimates the degree of the probability for individual test takers to solve a specific item. Thus, IRT attempts to predict the performance of a test taker based on his or her ability and the difficulty of the item. In addition, IRT places all items on a single difficulty scale, which is the same for all test takers (Bühner, 2006, p. 300). In contrast, CTT can only calculate indexes on the grounds of the performance of a whole group; therefore, with CTT it is difficult to compare alternate test forms between two different groups. However, IRT is very difficult to implement because (a) it needs large groups of test takers, (b) special software is required for analyzing test data, and (c) probabilistic test theory demands complex mathematical knowledge (Grotjahn, 2003, p. 18).

Regarding the HSK, Xiè Xiǎoqìng (1998, [2002] 2006) analyzed different score equation methods, some of them based on IRT. He found that some IRT-based models were useful for building the HSK item pool. The item pool was calibrated with the help of the one-parameter Rasch model as implemented by the BILOG simultaneous parameter estimation (Xiè Xiǎoqìng, 1998, p. 88). Originally, the HSK was built completely based on CTT, which has several limitations. Thus, Huáng Chūnxiá ([2004] 2006) called for refining the quality of the HSK by adopting methods related to IRT (p. 304).

3.2 Quality standards of language tests

Quality standards or test qualities (zhìliàng biāozhǔn质量标准) are central for developing and using tests, especially scientific, psychological tests (Bühner, 2006; Lienert and Raatz, 1994; Moosbrugger and Kelava, 2007, p. 8; Rost, 2004). Thus, as a standardized, large-scale, high-stakes language proficiency test, the quality standards or quality criteria of the HSK have to be investigated. The main quality criteria are objectivity, reliability, and validity, also referred to as essential measurement qualities (Bachman, 1990, p. 24) because they “provide the major justification for using test scores” (Bachman and Palmer, 1996, p. 19; Grotjahn, 2000, 2003). Further criteria are fairness, standardization, authenticity, economy, transparency, practicability, and washback (Bachman and Palmer, 1996; Grotjahn, 2000, p. 308, 2003, p. 19; Lienert and Raatz, 1994, pp. 7–14; Moosbrugger and Kelava, ← 48 | 49 → 2007, p. 8).92 Bachman and Palmer (1996) introduced the “usefulness” of a test as an overall criterion, and this concept was developed for solving the key problem of focusing on each quality criterion separately: How should all criteria be combined to ensure a good assessment? Bachman and Palmer (1996) claim that in traditional approaches, test qualities have been handled “more or less” independently, and one main goal was to maximize all of them, and “language testers have been told that the qualities of reliability and validity are essentially in conflict” (pp. 17–18; cf. Heaton, 1988; Underhill, 1982).93 In fact, there is some “tension” among some test qualities (Hughes, 1989). On the other hand, test qualities can also support each other (e.g., reliability supports predictive validity; cf. Lienert and Raatz, 1994, pp. 13–14). Therefore, these criteria are merely elements or features which have to be arranged into or tailored for a specific validation procedure or the development of a specific test. The test itself, in turn, has to be designed for meeting a specific purpose. Bachman and Palmer (1996) describe this situation in the following way:

… test developers need to find an appropriate balance among these qualities, and that will vary from one testing situation to another. This is because what constitutes an appropriate balance can be determined only by considering the different qualities in combination as they affect the overall usefulness of a particular test. (ibid., p. 18)

Quality criteria have to be seen under the specific light under which the test is constructed.94 In regard to the validation of a test, this means that besides of the issues which occur when defining a construct, this is another reason why an argument-based approach is useful. It is also the reason, why the Standards underscore the use of an argument as a framework (AERA, APA, and NCME 1999). Therefore, in this section the basic concepts of these language test qualities will be explained and ← 49 | 50 → then related to concrete examples in proficiency testing for CFL. A second reason for using an argument-based approach is that the criteria themselves cannot always be clearly separated from each other. Nevertheless, the understanding of the quality criteria is fundamental for the validation chapter in this dissertation.

3.2.1 Objectivity

For Lienert and Raatz (1994) objectivity is the “degree of how independent the scores of a test are from the investigator” (p. 7); this means that if different investigators derive the same scores from the performance of the same candidates, the test is absolutely objective (Ingenkamp, [1985] 1997). Cronbach ([1949] 1970) says:

If a test is objective, every observer or judge seeing a performance arrives at precisely the same report. (ibid., p. 28)

Thus, objectivity is closely related to the standardization of the carrying out, the scoring, and the interpretation of an assessment. In German testing literature, objectivity is often subdivided into Durchführungsobjektivität (administration objectivity), Auswertungsobjektivität (evaluation objectivity) and Interpretationsobjektivität (scoring objectivity) (Bühner, 2006, pp. 34–35;Grotjahn, 2000, pp. 309–310, 2003, p. 19; Lienert and Raatz, 1994, pp. 7–8; Moosbrugger and Kelava, 2007, pp. 8–10).Administration objectivity applies to how independent test scores are from the test administration. The degree of the administration objectivity depends mainly on how standardized the test administration is, including the behavior of the test administrators. The administration of the HSK could be considered highly objective. Administrators normally strictly adhered to the time limitations provided for each of the four sections95 of the HSK, for instance during the examination candidates were not allowed to return to earlier parts after the scheduled time for a specific section expired, or to begin with a new part before the scheduled time plan. Usually, the seating arrangement of the candidates had been determined by the administrators in advance (Meyer, 2009, pp. 25–26).96 Sūn Déjīn (2007) says HSK executives ← 50 | 51 → tried to minimize other factors as much as possible that could theoretically influence HSK test taker performance. For example, testing site arrangement, test administrators’ behavior, and strict adherence to test executing procedures were carefully controlled (p. 135). All explanations were read out loud by the chief test administrator (zhǔkǎo rényuán主考人员), and these instructions were kept to a minimum. All other instructions were played from a sound storage (Xiè Xiǎoqìng, [1995c] 1997, p. 61).Evaluation objectivity applies to the rules for the examination of the observed test performance of the test candidates. This criterion is strongly related to the format of the test items. If a test consists just of a gap filling (cloze test) of Chinese characters in a Chinese text, and if the solution of every gap is always only one specific character, then this cloze test is absolutely objective in respect to the scoring aspect. Other examples are multiple-choice items where only one answer is correct. Therefore, the Elementary-Intermediate HSK could be regarded as highly objective because out of 170 items, 154 items were multiple-choice format, with just one answer being correct. In the remaining 16 items test takers had to fill out a cloze test (tián kòng填空), where in each gap one character had to be written. Liú Yīnglín ([1990c] 1994, p. 3) marked these items as “semi-objective” (bàn kèguān半客观). Actually, they could be considered as almost 100%objective because just one specific character was counted as correct. However, Liú Yīnglín ([1990c] 1994) might have classified them this way because determining what counts as a correct character might, in specific cases, differ from one scorer to another, according to how strictly scoring rules were applied to Chinese characters (in respect to character strokes). Scoring objectivity means the interpretation of the obtained scores and how independent this interpretation is from the person who interprets the score. If the test has been standardized with a norm-reference group, and the score report clarifies how test takers are distributed among this reference group, stakeholders can interpret obtained scores in relation to the norm group, and the interpretation is objective regarding this aspect. The old HSK was standardized according to a norm-referenced group, and it was possible to make inferences such as “the test taker has obtained x points and he or she belongs to the best y percent of the test takers in the HSK norm-reference group.” Nevertheless, the complicated level system of the old HSK led to confusion among test takers as well as test administrators (Meyer, 2009, p. 26).97 ← 51 | 52 →

3.2.2 Reliability

Reliability states how exactly a test measures something. If we repeat a measurement, and if the testing conditions and the characteristic or trait of the tested person one wants to measure do not change, and if the result we receive is exactly the same, then our measurement tool possesses an absolute reliability (Moosbrugger and Kelava, 2007, p. 11). However, “two sets of measurement of the same features of the same individuals will never exactly duplicate each other” (Stanley, 1971, p. 356) because all assessments are “to a certain extent unreliable” (Crocker and Algina, 1986, p. 105). Thus, reliability is defined in the following way:

When a feature or an attribute of anything, whether in the physical, the biological, or the social sciences, is measured, that measurement contains a certain amount of chance error. … The amount of chance error may be large or small, but it is universally present to some extent. … In some cases, the discrepancies between two sets of measurements may be expressed in miles and, in other cases, in millionths of a millimeter; but if the unit of measurement is fine enough in relation to the accuracy of the measurements, discrepancies always will appear. … [this] is meant by unreliability. At the same time, however, repeated measurements of a series of objects or individuals will ordinarily show some consistency. … This tendency toward consistency from one set of measurements to another is called reliability. (Stanley, 1971, p. 356; emphasis in original)

A language test thus might be highly reliable, but at the same time it might primarily measure a candidate’s general knowledge, not his language ability (Grotjahn, 2003, p. 20). Reliability is also related to the amount of items: in other words, the more items that measure the ability, the more accurate or reliable the assessment tends to be. This is one reason why professional, large-scale language tests consist of relatively large number of items (Grotjahn, 2003, p. 20). When investigating reliability, we have to consider the origin of errors of measurement. Factors which might yield to impreciseness in the measurement process can be various and could stem from differing testing conditions, fatigue, anxiety, lack of motivation, or test wiseness98 of the test candidates, or widely different ratings for some productive language performance observed in a test (Bachman, 1990, pp. 24, 160). Therefore, Bachman (1990) summarizes:

In any testing situation, there are likely to be several different sources of measurement error, so that the primary concerns in examining reliability of test scores are first, to identify the different sources of error, and then to use the appropriate empirical procedures for estimating the effect of these sources of error on test scores. (ibid., p. 24) ← 52 | 53 →

In language assessment, in regard to sources of measurement error we can distinguish between (1) unsystematic or unpredictable errors (random factors99), (2) attributes of the test takers which are believed not to be part of the construct we want to measure (cf. footnote 98), and (3) test method facets (cf. Bachman, 1990, p. 164). Minimizing the effects we are able to control will maximize reliability. High reliability is a necessary condition for valid test scores (Bachman, 1990, p. 160), but high reliability itself does not indicate whether the interpretation of the test scores is valid. For investigating reliability, four different procedures have been developed according to the CTT (Moosbrugger and Kelava, 2007, pp. 12–13): (1) retest reliability (zàicè xìndù再测信度), (2) parallel test reliability (fùběn xìndù复本信度), (3) split-half reliability (fēnbàn xìndù分半信度), and (4) internal consistency (nèizài yízhìxìng xìndù内在一致性信度).100 Retest reliability uses the same test at two different points of time. Here, it underlies the theoretical assumption that the trait or characteristic of the person one intends to measure does not change between the two tests, and that test takers do not memorize items, and that practicing the test does not affect the performance. The reliability estimate is the correlation between both measurements, indicated by the reliability index, which ranges between zero and one, one indicating that the test measures without any measurement error. The amount of time that has passed between the two measurements influences the reliability, most notably due to memory effects, or the trait of the test taker has changed due to training effects. Parallel test reliability can be assessed by using two different parallel versions of one test designed in a way to measure the same construct. However, the above-mentioned memory effects or changes of the trait can be eliminated or controlled. Regarding these aspects, this procedure is believed to be better than the retest method, and it is also considered the ideal way when estimating reliability (Grotjahn, 2003, p. 20; Moosbrugger and Kelava, 2007, p. 12). Split-half reliability is applied when it is not possible to conduct a test again (e.g., when the test takers are not available for a retake), or to develop a parallel form. In this case, the test is cut into two halves101 (which should resemble each other), and the correlation between these halves is computed. In other words, it is a mathematical method that “produces” two parallel tests (Bachman, 1990, p. 172). Afterwards, the reliability needs to be increased with the Spearman-Brown prophecy formula. Internal consistency calculates the correlation among the items of the test. Every item is regarded as a test. First, the correlation among all the items is computed; then, a mean correlation for the whole test is estimated. When using this method the items being compared with each other should try to comprise the same construct (Grotjahn, 2003, p. 21). For measuring internal consistency, Cronbach (1951) developed ← 53 | 54 → the so-called “Cronbach’s alpha coefficient,” which ranges between 0 and 1 (if the assumptions are satisfied, cf. Bachman, 1990, p. 178), with 0 meaning the test measures absolutely unreliable, and 1 meaning that the test measures absolutely reliable (Bachman, 1990, p. 177; Grotjahn, 2003, p. 21). Grotjahn (2003) says that if the test is used to differentiate between individuals, the reliability should be 0.9 or even higher. For comparing groups, a reliability of approximately 0.6 is often sufficient (p. 21).

3.2.3 Validity (overview)102

In psychometric testing, the term validity103 is defined as the degree of how adequately a test measures something the test is intended to measure104 (Bachman and Palmer, 1996, p. 21; Garrett, 1937, p. 324; Grotjahn, 2000, p. 312, 2003, p. 21; Lienert and Raatz, [1961] 1994, p. 10;Rost, [1996] 2004, p. 34), or

[a] test is valid, when it measures the characteristic (trait) it should measure, and when it does not measure something else. (Moosbrugger and Kelava, 2007, p. 13)

As shown in chapter 2, clearly defining this “something” in language testing is extremely difficult, if not impossible. Besides, considering the above-mentioned definitions the purpose of a test is very important when evaluating whether score inferences are valid (Grotjahn, 2003, p. 21),105 or to what extent or degree the score interpretation fits the purpose of the test. The HSK assesses the language use of Pǔtōnghuà, the standard language of Chinese used in the People’s Republic of China. On the other hand, the TOCFL for instance, aims to measure the standard Chinese language used nowadays in Taiwan, Guóyǔ. If someone participates in the TOCFL, interpreting his score with regard to how well he or she might be able to use Chinese language in Mainland China will be somewhat limited.106 Thus, a test ← 54 | 55 → taker who, for example, receives a high score on the HSK does not necessarily score high on the TOCFL and vice versa. On the contrary, it can be hypothesized that if the HSK and the TOCFL indeed replicate “authentic,” or better to say typical language use in Mainland China and in Taiwan, and if there is a considerable difference in this use, test takers should normally107 show a notable difference in test performance. This will occur not merely because the HSK uses simplified characters and the TOCFL traditional108 ones; it will also happen because words and structures used in many situations in both target language domains differ. On that account, Bachman (1990) underscores the use of the test:

It is also misleading to speak simply of the validity of test scores, since the interpretation and use we make of test performance may not be equally valid for all abilities and in all contexts. … To refer to a test score as valid, without reference to the specific ability or abilities the test is designed to measure and the uses for which the test is intended, is therefore more than a terminological inaccuracy. At the very least, it reflects a fundamental misunderstanding of validity; at worst, it may represent an unsubstantiated claim about the interpretation and use of test scores. (ibid., p. 238; italics in original)

Campbell and Fiske (1959) describe the relation between reliability and validity in the following way:

Reliability is the agreement between two efforts to measure the same trait through maximally similar methods. Validity is represented in the agreement between two attempts to measure the same trait through maximally different methods. (Campbell and Fiske, 1959, p. 83)

Bachman (1990) had this statement transferred to Figure 4:

Figure 4: Comparison of the measurement focus of reliability and validity.
From Bachman, 1990, p. 240 (slightly adapted).

With this figure, he visualizes a gradual shift within a continuum from reliability to validity, and he asks the question whether, for instance, the correlation between ← 55 | 56 → concurrent scores on two cloze tests based on different text passages should be interpreted as reliability or as validity (Bachman, 1990, p. 240). If the text passages are considered two different methods, the correlation will be interpreted as validity. If the passages are regarded as reflecting the same method, the correlation has to be seen under the light of reliability. Objectivity, reliability and validity are related to each other in a hierarchical way, which is illustrated in Figure 5:

Figure 5: Relation between objectivity, reliability, and validity.
From Lienert and Raatz, 1994, p. 13 (slightly adapted).

According to Lienert and Raatz (1994, pp. 13f.), we can summarize the main points regarding the relation between objectivity, reliability, and validity, as follows: First, objectivity and reliability are necessary conditions for validity, but not sufficient conditions. Objectivity influences reliability, and reliability builds a frame for validity. We have an inferential chain of conditions: if a test is not objective, it cannot be reliable, and if a test is not reliable, it cannot lead to valid score interpretations (for concurrent validity). On the other hand, high objectivity and reliability are merely necessary conditions for validity (Rost, 2004, p. 33). Thus, even if a test possesses high objectivity and reliability, score interpretations need not to be valid (cf. Grotjahn, 2000, p. 315). Second, parallel test reliability and retest reliability cannot be higher than estimates of internal consistency, or the administration and evaluation objectivity (though these types of objectivity are difficult to quantify). Third, in regard to a criterion, a test can never be more valid than reliable. And fourth, if a test possesses high criterion validity, the test has a high degree of objectivity and reliability. ← 56 | 57 →

As denoted by the adjunct “with regard to a criterion” in Figure 5, validity can be seen from different angles, also called facets of validity (Messick, 1989b). In psychological testing literature, when referring to validity it can be distinguished between criterion validity (section 3.3.2), content validity (section 3.3.3), and construct validity (section 3.3.4). Besides, a fourth aspect of validity often mentioned is called face validity (Bachman, 1990, 1996; Grotjahn 2000, 2003), which refers to the degree to which a specific testing procedure is appraised as valid in the eyes of the test takers and the test administrators (Lienert and Raatz, 1994, p. 137; Moosbrugger and Kelava, 2007, p. 15; Rost, 2004, pp. 45–46). The aspect of face validity is of special interest because if test takers “do not try their best” or “do not find the test useful” (Bachman, 1990, pp. 288–289), this might harm the practical use of the test (Grotjahn, 2003).109

3.2.4 Fairness

Fairness should ensure that some items do not systematically appear easier (or more difficult) for certain groups of test takers than for others due to factors that have nothing to do with the construct. A performance difference between different groups (Differential Item Functioning or DIF) itself does not necessarily lead to unfairness. However, in some cases the mere administration of a test might lead to unfairness. One example occurs when some test takers perceive worse sound quality in a listening comprehension subtest because they are sitting in a part of the test room that negatively influences their performance (e.g., noise or too much distance from the sound source). Bachman (1990) underscores that

[i]t is important to note that differences in group performance in themselves do not necessarily indicate the presence of bias, since differences may reflect genuine differences between the groups on the ability in question. (ibid., p. 271)

Many people blamed the old HSK for putting too much emphasis on Chinese characters (e.g., Jìng Chéng, 2004), which would favor test takers with Japanese or Korean mother tongues. If the HSK was considered biased in favor of Japanese or Korean test takers is a question of construct validity, which means that one first has to define what the test or the item is intended to measure. Jìng Chéng (2004) correctly claims the HSK listening subtest favors Japanese and Korean students because the multiple-choice answers were only given in Chinese characters. The question is whether the construct “listening ability” should merely consist of how well someone understands spoken language. If so, then Japanese and Korean test takers might have an advantage110 because these students scan the answer choices more quickly; ← 57 | 58 → this allows them to focus better on the listening material. However, if the answers have to be displayed in written form—e.g., due to technical reasons—a question arises concerning which graphical representation of Chinese would be the most neutral for all examinees.111 On the other hand, when assessing the reading ability of non-natives of Chinese, Japanese and Korean will usually have a huge advantage simply because knowledge of Chinese characters plays an important role in their educational and cultural background. Thus, these students generally read faster and comprehend Chinese characters better. However, this does not count as a bias or an unfair advantage because the ability to read and understand is exactly what a reading comprehension subtest in Chinese should measure, regardless where and when this ability was acquired or learned. In this case, the advantage is not outside the construct one wants to measure, and it cannot be regarded as biased. Further, this situation does not change the fact that Japanese or Korean learners of Chinese can normally learn to read Chinese texts while only investing a significantly lower amount of time to achieve a specific level of proficiency as compared to Western learners. Contrary, if we have the opposite situation and if the origin of differences in performances of groups has nothing to do with the construct, Bachman says (1990):

[W]hen systematic differences in test performance occur that appear to be associated with characteristics not logically related to the ability in question, we must fully investigate the possibility that the test is biased. (ibid., p. 272)

3.2.5 Norming112

For this criterion, a group has to be defined which is to be regarded as representative for all test takers for the interpretation of scores, a so-called norm-reference group. Then, a sample of the norm-reference group takes the test, which leads to norm results. Afterwards, results of test takers can be compared with the results of the norm-reference group, which means that “the quality of each performance is judged … according to the standard of the group” (Davies et al., 1999, p. 130). Normally, the scores the test takers of the norm-reference group have received will be shaped according to a normal distribution. The old HSK showed the feature of a norm-referenced test (chángmó cānzhào kǎoshì常模参照考试) that related the score of a single test candidate to a norm-reference group. This was also displayed on the HSK score report (chéngjìdān成绩单) together with the mean and the standard deviation of the examination, and HSK test takers were able to see to ← 58 | 59 → which percentage of the best performing testees of the norm group they belonged (cf. section 4.2).

3.2.6 Authenticity

Authenticity is another important quality criterion. However, estimates of authenticity are subjective, and Spolsky (1985) has already alluded to the limitations of authenticity by saying that “[a]ny language test is by its very nature inauthentic” (ibid., pp. 31, 39).113 Authenticity often refers to what extent the characteristics of a test task resemble a task in a real-life situation.114 This point has to be taken into consideration when developing and using language tests because if there is a close relation between the test task and similar tasks in real-life situations, we can better generalize from the test performance to the performance on tasks in the corresponding target language domain. Besides, authenticity is closely related to face validity. If test takers consider test items very inauthentic, this could influence their performance (Bachman and Palmer, 1996, p. 24;Douglas, 1997, p. 116; Grotjahn, 2003; McNamara, 1996). Spoken in terms of construct validity, the crucial question is if test takers perform at their best or not. Bachman and Palmer (1996, p. 24) believe that if test candidates perceive items or a whole test as relevant, that this should motivate them to perform the best way they are able to. However, Grotjahn (2000, 2003) underscores that authenticity should not be overestimated. He agrees that authenticity might lead to a higher degree of acceptance among test takers, but in high-stakes language testing settings like the German TestDaF, whose results are used by universities to decide whether a candidate will be admitted to a program, test takers will presumably do their best even when confronted with some items they might view as less authentic (2000, p. 319). Spolsky (1985, p. 35) indicates that many people rank discrete-point tests as inauthentic, but at the same time he mentions the C-Test115 (Klein-Braley, 1981; Raatz and Klein-Braley, 1982), which ← 59 | 60 → at first glance seems to be very unauthentic.116 However, tasks of the C-Test can also be regarded as normal language behavior.117 The Elementary-Intermediate HSK had a character gap filling section, as well.

3.3 Validity theory and validation

Which inferences and interpretations can be derived from HSK scores when considering what the HSK aims to measure? This question refers to the concept of validity. Validation is the process of gathering the evidence so that the inferences drawn from test scores are rational and appropriate, and that the decisions made on the basis of the test are justified. Using or developing a test requires evidence (validation) that a test adequately measures what it is intended for, and therefore that the interpretations of test scores justify the use of the test (validity). There is no one validation method, and as a practitioner it is difficult to find concrete advice on how to validate an examination or a test in practice, although several validation studies for language tests have been recently conducted that make use of specific validation methodologies. These studies include, for instance, the validation study by Chapelle et al. (2008) on the Test of English as a Foreign Language. In the following sections, I will discuss the notion of validity in detail (sections 3.3.1 to 3.3.5), and outline existing concepts and methods of validation. Then, I will argue which validity theory I will pursue in this work, and which validation approach I will use for the present study (sections 3.4.1 to 3.4.3).

3.3.1 What is validity?

Validity is believed to be the most essential and crucial quality criterion of psychological tests in general (Dài Zhōnghéng, 1987; Moosbrugger and Kelava, 2007, p. 13), and it is also pivotal to language tests (Bachman and Palmer, 1996, p. 19; Guō Shùjūn, [1995] 2006)because it concerns the meaning and interpretation placed on and derived from test results. At the same time, validity is also that quality criterion of a test which is the most complex and most difficult to determine (Hartig, Frey, and Jude, 2007, p. 136). Concepts of validity and the historical development of validity theory are fairly well documented in the CFL assessment literature by Chinese scholars up to the end of the 20th century (Cháng Xiǎoyǔ, [2005] 2006; Chén Hóng, [1997a] 2006, [1997b] 2006; Xiè Xiǎoqìng, [2001] 2006; Zhāng Kǎi, [2005b] 2006), and this knowledge of validity concepts, which have been widely adopted in ← 60 | 61 → psychometric testing, partly built an important basis for harsh criticism on the implementation of the HSK construct (Chén Hóng, [1997b] 2006; more detailed in sections 4.4 and 4.5).

The term validity defines the degree to which a test adequately measures something it is intended to measure, while “something” refers to a trait, often a theoretical construct, or a network of intertwined traits. The higher the overall validity of a test, the better it measures the construct. Validity is a “matter of degree, not all or none” (Messick, 1989b, p. 13). Therefore, a statement such as “the HSK is absolutely valid” (Gěng Èrlǐng, 1994, p. 382) is truly wrong. Besides, in psychological testing validity must always be related to the specific context in which the test is used (Grotjahn, 2000, p. 312). In the field of CFL testing, we want to measure a learner’s Chinese language ability. One of the major challenges of language testing is that our measuring instrument and the trait both consist of language. The other significant issue relates to our construct: What is “Chinese language ability”? Is there a single best way to define Chinese language proficiency? As shown earlier in chapter 2, no single best way to define the construct has been devised so far. Thus, an alternative framework is needed for implementing the validation in this work.

Historically, there have been two important developments in validity theory (Messick, 1989b, pp. 18–20). One was a shift from numerous separate criterion-related validities to a small number of validity types, which finally led to a unified validity concept. The second was a shift from prediction to explanation, in other words to a “sound empirically grounded interpretation of the scores on which the prediction is based” (Messick, 1989b, p. 18). In brief, there are both classical models, which include several types of validity, and modern models, which present validity as a single construct. First concepts of validity evolved at the beginning of the 20th century (Hartig et al., 2007, p. 137). In the following three sections (3.3.2 to 3.3.4), I will describe how different historical concepts of validity emerged, present their meanings, and finally give concrete examples of validity in CFL testing. Such an historical approach helps to understand the central concept of construct validity because

[The concept of construct validity] has undergone several transformations since its introduction about fifty years ago. As a result of these shifts in interpretation, construct validity has accumulated several layers of meaning that are easily blurred. (Kane, 2006, p. 18)

3.3.2 Criterion validity

Within the last 90 years, validity has been defined in a variety of ways. Between 1920 and 1950, the focus was on the prediction of specific criteria. During that period, criterion validity118 was seen as the “gold standard” for validity (Angoff, 1988; Cronbach, 1971; Kane, 2006; Moss, 1992; Shepard, 1993). Guilford claimed that ← 61 | 62 → “in a very general sense, a test is valid for anything with which it correlates” (Guilford, 1946, p. 429). The so-called “criterion validity” was defined in the first edition of Educational Measurement as “the correlation between the actual test score and the ‘true’ criterion score” (Cureton, 1951, p. 623). This is why a correlation coefficient mostly indicates the strength of the correlation. Originally, validation merely referred to the degree of how well a test estimated the criterion. Interestingly, at that time a test itself was considered valid for any criterion it provided accurate estimates (Gullikson, 1950). In terms of criterion validity, we can distinguish between predictive validity and concurrent validity. Predictive validity provides us with a score used to make inferences about the future performance of a person, a criterion that is not available at the time of testing. A well known example in Germany is the Test for medical study courses(Test für medizinische Studiengänge, TMS), a test which tries to predict the success of someone in a university subject related to medicine.119 Concurrent validity uses a criterion measured shortly after or before the test, for example grades given by a teacher. Historically, the first validation of a test was done by predicting a criterion (Lissitz and Samuelson, 2007). The criterion concept worked and continues to work quite well when a plausible criterion is readily available—for example, if the test is to predict future performance (e.g., prediction for success in studying a certain subject, in flight training or employment testing; Guion, 1998;Kane, 2006, p. 18).

According to Kane (2006), the criterion model has two major advantages. First, a typical interpretation in a criterion model would claim that applicants with higher scores on the test could be expected to exhibit better performance in some activity (e.g., on the job, which could be easily checked).Second, criterion-related validity appears to be—and to some extent actually is—objective (p. 18). A serious limitation of criterion validity regards the difficulty in determining or finding an adequate criterion. When determining criterion-related validity in language testing, we are looking at how well an independent external criterion, for example the results of another test, conforms to our test scores (Grotjahn, 2000, p. 313). The outcome of this comparison is usually characterized by a correlation coefficient (Pearson’s product-moment correlation coefficient), whose absolute value is between +1 and −1. If candidates take an achievement test for a Chinese class at university level for example, it might be difficult to find a criterion better than the test itself. In CFL language proficiency testing, a possible criterion for the HSK score of a testee could be to let him or her take the TOCFL shortly after the HSK. The correlation coefficient indicates the extent to what both tests possess common variance (a third variable could have an influence; thus, explanations on the construct are always ← 62 | 63 → difficult). A correlation coefficient of 0.8, for example, would mean that both tests seem to assess the same construct to an extent of 64% (the value has to be squared). However, if we do so, we should assume that the HSK and the TOCFL measure the same construct and that both tests fulfill comparable quality standards.120 The critical point is how well the criterion resembles our construct.121 On a theoretical level, there is also a logical problem: how can the criterion be validated? Ebel (1961) says that “even when a second criterion can be identified as a basis for validating the initial criterion, this simply pushes the problem back one step”—we are thus facing a problem of circularity (Kane, 2006, p. 19). In evaluating criterion validity it is important to note that objectivity influences reliability, and reliability in turn influences validity. A test with low objectivity and low reliability cannot be “valid,” but high objectivity and high reliability does not necessarily mean that the test is valid. High objectivity and reliability are a necessary, but not a sufficient condition.

3.3.3 Content validity

The concept of content validity is quite simple: every test covers specific content, and the term content validity refers to how well the content of a test represents some content it is proposed to measure (content domain).In other words: “The content model interprets test scores based on a sample of performances in some area of activity as an estimate of overall level of skill in that activity” (Kane, 2006, p. 19). A content domain could be a curriculum, a description of some area of expertise (a sample of some type of performance122), or a detailed test specification of the test format (Alderson, Clapham, and Wall, 1995; Davidson and Lynch, 2002). According to Guion (1977), three conditions under which we are allowed to extrapolate from the performance of a person in our test to the performance of the domain we originally sampled from are: (a) the observed performances have to be considered a representative sample from the domain; (b) the performances are evaluated appropriately and fairly; and (c) the sample is large enough to control sampling error.123 ← 63 | 64 →

Content validity is usually based on expert judgments (Grotjahn, 2003, p. 22; Kecker, 2010, p. 133), but experts might considerably differ in their estimates of what should comprise the content of a test (Grotjahn,2000, p. 312, 2003, p. 22). Content validity is especially problematic when it is used for validity claims about cognitive processes or theoretical constructs (Cronbach, 1971, p. 452). Another limitation of content validity is the fact that it “tends to be subjective and to have confirmatory bias” (Kane, 2006, p. 19).As a result judgments about the relevance and representativeness of test tasks have a tendency to confirm the proposed interpretation, especially when test developers make them. Messick (1989b) indicated that content-based validity evidence itself actually does not include test scores, but according to Messick the term validity is a judgment “based on test scores or other modes of assessment” (Messick, 1989b, p. 13).124 Content-based validity does not provide direct evidence for the “inferences to be made from test scores” (p. 17)as test responses and test scores are not included in the content model (this also accounts for construct validity). A listening comprehension test might cover the content of a curriculum fairly adequately, but other problems could arise. For instance, if listening passages in a Chinese achievement test are too long, the short-term memory of test takers could quickly be overstrained, and they would not have enough time to read the questions before the next listening item starts because their reading skills for reading Chinese characters are not sufficient;thus, their performance on the answers could be biased. These examples show that test content has to be implemented into a specific test situation that is intertwined with a theoretical construct.

In proficiency testing for CFL, content validity is an extremely important point, especially when it comes to reading Chinese characters. That is why the Hanban published a syllabus in 1988 covering the vocabulary, characters, and grammatical structures primarily used in the HSK, the Graded syllabus and graded standard of proficiency in Chinese (Hànyǔ Shuǐpíng Děngjí Biāozhǔn hé Děngjí Dàgāng汉语水平等级标准和等级大纲; short Dàgāng大纲). In principle, it includes the main content tested on the HSK.125 This syllabus was edited by CFL language testing experts at HSK Center of the Beijing Language and Culture University (cf. section 1.4). After a revision in 2001, the Hanban launched the new HSK in 2009, which has a completely new syllabus. The most distinct feature of the 2009 syllabus is an enormous decrease in vocabulary, characters and grammatical structures for essentially the same levels of competency compared with the old HSK. This is a very obvious example of how massively politics can influence testing, and it clearly reveals that language testing almost always serves an ulterior motive. Moreover, it raises the question how much and what kind of vocabulary and grammatical structures ← 64 | 65 → should be integrated into a CFL proficiency syllabus. Several studies that included or focused on a validation of the HSK syllabus (Dàgāng) have been conducted (e.g., Da Jun, 2007; Niè Hóngyīng, 2007; Zhāng Kǎi and Sūn Déjīn, 2006).

Although content-related evidence plays an important role in validation, other kinds of evidence are required to go “beyond the basic interpretation” (Kane, 2006, p. 19). However, even in the latest validation concepts used in recent years, content-related evidence still provides the basis upon which everything is built (Sireci, 2009, p. 33). To put it in Grotjahn’s (2000) words, content validity is a “necessary condition, but not a sufficient one for the validity of a test” (p. 312). Or vice versa, if we develop a test based on inappropriate content (in terms of the testing goal), inferences drawn from test taker scores can be considered invalid.

3.3.4 Construct validity

The notion of construct validity126 occurred in the 1950s (Cronbach and Meehl, 1955) and has its origin in psychological testing. Originally, it was an alternative to criterion and content models, and it was to be used when a real criterion was not available (Shepard, 1993, p. 416). At that time, Cronbach and Meehl were already aware of the point that construct validity was not just a substitute or a supplement for criterion and content validity. In fact, they suggested that construct validity was a fundamental concern, even when a test was validated using criterion or content evidence. They simply did not present the “construct” as a general framework for validity (Kane, 2006, p. 20). This last step in validity theory was left to Messick (1989b; see section 3.3.5). In psychological testing during the 1950s, one of the major questions concerned the manner of measuring abstract traits (theoretical attributes), such as ego strength (Cronbach, 1971), which finally led to the concept of construct validity. The core issue is that for abstract traits, there is no distinct or clear content to sample, nor is there a uniquely pertinent criterion to predict, and Cronbach suggested that any description “that refers to the person’s internal processes (anxiety, insight) invariably requires construct validation” (1971, pp. 451 and 462). According to Cronbach, for such traits there is “a theory that sketches out the presumed nature of the trait” (1971, pp. 462–463). In Cronbach and Meehl’s (1955) model, construct validity followed the hypothetico-deductive model of scientific theories, in which a theory consists of a network of relationships linking theoretical constructs to each other and to observable attributes (Kane, 2006, p. 20). So, the evidence for construct validity requires the definition of the specific construct ← 65 | 66 → to be framed by a theory, often a so-called nomological network (Cronbach and Meehl, 1955; Kecker, 2010, p. 134). The idea of the network is to tie theoretical attributes with observable attributes; the observable attributes yield, via the operationalization (the theory or the nomological network), to the abstract traits.

Cronbach and Meehl (1955) claimed that these three different forms of validity had to be included into the validation of a test to prevent test developers from choosing merely one of them and stating their test or the interpretations of their test scores were valid (Kane, 2006, 2008). In 1971, Cronbach emphasized the need for an overall evaluation of validity in testing:

Validation of an instrument calls for an integration of many types of evidence. The varieties of investigation are not alternatives any one of which would be adequate. The investigations supplement one another … For purposes of exposition, it is necessary to subdivide what in the end must be a comprehensive, integrated evaluation of the test. (Cronbach, 1971, p. 445; italics in original)

3.3.5 Messick’s unitary concept

By the late 1970s, two opposing trends existed in validity theory. One tried to identify specific kinds of validity, which should have provided help for validating particular interpretations and uses of test scores; the other one tried to deliver a unified validity concept.127 Validity theorists, such as Cronbach, Guion, Messick and others belonged to the second group mentioned above, and they tended towards a more unified approach (Cronbach, 1980b; Guion, 1977, 1980; Messick, 1975, 1981; Tenopyr, 1977) because they were concerned about the growing tendency to “treat validation methodology as a toolkit, with different models to be employed for different assessments” (Kane, 2006, p. 21). The idea to subsume all evidence for validity under construct validity already emerged in the 1950s, when Loevinger said “since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole from a scientific point of view” (Loevinger, 1957, p. 636), but it took until the early 1980s before construct validity as a general approach was widely accepted (Anastasi, 1986; Embretson, 1983; Guion, 1977; Messick, 1980, 1988, 1989b). In Chinese HSK literature for example, in 1990 Liú Yīnglín still speaks of different “types” of validity ([1990] 1994), and he does not explain how to weigh them against each other.128 Several years later, Zhāng Kǎi ([1994] 2006, p. 202), who adopted the notion from Bachman (1990), became the ← 66 | 67 → first Chinese CFL specialist who explicitly mentioned construct validity as a unitary concept (zhěngtǐ gàiniàn整体概念).129

Messick developed a unifying framework for validity, in which he “relegated the content model to a subsidiary role …, and he treated the criterion model as an ancillary methodology for validating secondary measures of construct” (Kane, 2006, p. 21). Kane summarizes the achievement of Messick’s approach as follows:

The adoption of the construct model as the unified framework for validity had three major positive effects. First, the construct model tended to focus attention on a broad array of issues inherent in the interpretations and uses of test scores, and not simply on the correlation of test scores with specific criteria in particular settings and populations. Second, it emphasized the pervasive role of assumptions in score interpretations and the need to check these assumptions. Finally, the construct model allowed for the possibility of alternative interpretations and uses of test scores. (Kane, 2006, p. 21)

In his very influential article “Validity”130 in Educational Measurement (Linn, 1989), Messick defined validity in his opening sentence in the following way:

Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989b, p. 13; italics in original)

Many validity experts still base their view of validity on this definition, and it resembles very much the definition of validity in the Standards (cf. AERA, APA, and NCME, 1999, p. 9). Not surprisingly, this sentence has been placed right at the beginning of the validity chapter of Zhāng Kǎi’s (2006) edited volume on language testing theory and Chinese testing research, where it is cited by Cháng Xiǎoyǔ ([2005] 2006, p. 167).131 Messick stated explicitly that all kinds of evidence or facets of validity have to be integrated into construct validity because it is not sufficient to merely draw on one or two kinds of validity evidence. Additionally, Messick indicated that it is not the quality of the test itself that can be considered valid or not, but rather the accountability, or rational logic, how inferences (and later actions and decisions) are derived from test scores. In the second edition of Educational Measurement (Thorndike, 1971), Cronbach had already alluded to the validity of scores having to be demonstrated for every single, specific test use, which means that every test has to be seen in the light of its particular purpose. If ← 67 | 68 → the purpose of the test changes the score interpretation must change as well. Cronbach mentioned this point for avoiding test misuse.132

In addition, Messick turned the attention of the validity theorists to the consequences of test use. In his model he explicitly included value implications implicit in testing and social consequences (1989b, p. 20). Value implications are fundamental for score interpretations; thus, they are also vital for the justifications of score-inferences. “Another way to put this is to note that data are not information; information is that which results from the interpretation of data” (Mitroff and Sagasti, 1973, p. 123). And Kaplan (1964) stated as follows:

What serves as evidence is the result of a process of interpretation—facts do not speak for themselves; nevertheless facts must be given a hearing, or the scientific point to the process of interpretation is lost. (ibid., p. 375)

Some measurement specialists have criticized Messick that adding value implications and social consequences to the validity framework “unduly burden the concept” (Messick, 1995, p. 748). Therefore, these critics argued for a more limited definition of validity (Borsboom, Mellenbergh, and van Heerden, 2004; Mehrens, 1997; Popham, 1997; Sackett, 1998). However, in fact, Messick did not add values; they are merely explicitly stated and identified because they are always intrinsic to the meaning and outcomes of testing, and that is why value aspects must be an integral part of the validation process (Messick, 1989a).

Value implications and social consequences are fundamental for CFL proficiency testing as well, which can these days be seen (2013) in the new HSK. So, one important reason for lowering the competency levels of the new HSK is purely political: to help promote CFL.133 Policymakers hope to make it easier for beginner students of Chinese to successfully participate in the official state language proficiency exam, which gives them a feeling that they are able to succeed despite the long—and sometimes frustrating—Chinese learning process, thereby gaining a feeling of success (Erfolgserlebnis). ← 68 | 69 →

Table 1: Facets of validity—the progressive matrix.134

Figure 6: Schematic circle diagram of Messick’s unified validity concept.
Drawn by the author of this dissertation.
← 69 | 70 →

In Messick’s unified validity framework we have two interconnected facets: in the left column, there is the source of justification of testing, which is based on the appraisal of the evidence and consequences; the row on the upper right-hand side refers to the functions or outcomes of testing, interpretation or use. Originally, in the cells labeled I to IV in Messick’s table, just the words typed in small capitals appear.135 However, he emphasizes that all the cells overlap, which means that they are not distinct (in Table 1 displayed with the dashed lines).Furthermore, he states that for a comprehensive validation approach, one should add in every cell the facets added in brackets. Shepard (1997) warns that test developers might not recognize the overlapping character of the cells (p. 6). In Messick’s table in 1989, construct validity does not appear in cells III and IV. However, in a later version it is included (Messick, 1995), which could mean that (a) test developers merely focus on finding evidence for construct validation, and (b) they could neglect investigating social consequences or values implicit in testing—validity facets often adopted and influenced by politicians. In Table 1, the evaluation of test use becomes a two-step process “from score to construct and from construct to use” (Kane, 2006, p. 21). In Figure 6, the overlapping character of the validity facets of Messick’s concept is illustrated in a circle diagram. It reveals that construct validity is of overriding and paramount importance to the other facets of validity. In this concept, these other facets—namely relevance/utility, value implications and social consequences—are all embraced within construct validity because all of them are constructed or related to the construct in their own way. Therefore, relevance/utility, value implications and social consequences are not independent, and they must not be interpreted in an isolated manner—rather, they have to be embedded into or related to the construct. In addition, the interaction between social consequences and value implications, or the interaction between the social consequences and relevance/utility, is clearly indicated, and these aspects interplay under the mantle of the construct.

Messick alludes to several influential sources in the literature on psychometric testing, namely the Standards (AERA, APA, and NCME 1985), the five editions of Anastasi’s Psychological Testing (1954, 1961, 1968, 1976, 1982) and Cronbach’s four editions of Essentials of Psychological Testing (1949, 1960, 1970, 1984), all of which show a clear tendency towards a unified framework. Finally, he argues for a unified approach:

Hence, the testing field, as reflected in the influential textbooks by Anastasi and by Cronbach as well as in the professional standards, is moving toward recognition of validity as a unitary concept, in the sense that score meaning as embodied in construct validity underlies all score-based inferences. But for a fully unified view of validity, it must also be recognized that the appropriateness, meaningfulness, and usefulness of score-based inferences depend as well on the social consequences of testing. Therefore, ← 70 | 71 → social values and social consequences cannot be ignored in considerations of validity. (Messick, 1989b, p. 19)

Messick highlights the problem of detracting from and forgery (threats) towards the construct. First, one possible threat might be that the construct is too broad and includes other factors which vitiate the construct one intends to measure. This results in construct-irrelevant variance. In this case, we can further distinguish between construct-irrelevant easiness and construct-irrelevant difficulty: construct-irrelevant easiness appears when biased items or task formats permit some individuals to answer correctly in a way that are irrelevant to the construct being assessed. Construct-irrelevant difficulty is the opposite case: extraneous aspects of the item make it more difficult for some individuals or groups to answer it correctly. A second (and opposite) threat is that the construct measured is too narrow and does not include essential elements it should measure, namely construct underrepresentation. An example for construct-irrelevant variance in CFL proficiency testing scores is the listening subtest of the old HSK. The answers on the answer sheet are in Chinese characters and not in Hànyǔ Pīnyīn, so this part also measures test takers’ reading ability of Chinese characters, not only the listening skill.136 Massive construct underrepresentation is for instance that the old HSK did not assess productive oral skills137, at least under the premise that the construct of the old HSK intended to cover productive oral skills. Thus, the definition of the construct is the pivot in language testing and in testing in general, and validity theory must offer a model clearly depicting and underscoring this distinctive feature.

Messick continually stressed validation as a scientific process138, and every test should repeatedly undergo validation research. This process helps to refine the construct and construct validity in two ways. First, sources of construct-irrelevant variance can be identified and eliminated, or (at least) reduced. Second, the construct can be strengthened through confirmatory validation studies that buttress the construct. Messick (1989) alluded to Cronbach’s “validation as persuasive argument.” Cronbach stresses that “the argument must link concepts, evidence, social and personal consequences, and values” (Cronbach, 1988, p. 4). Indeed, this is the core notion of today’s argument-based approaches for validation (amongst others Bachman, 2005; Bachman and Palmer, 2010; Kane, 1992, 2001, 2002, 2006; Mislevy ← 71 | 72 → et al., 2003).This dissertation follows an adaption of Kane’s argument-based approach, which will be explained in more detail in section 3.4.

Messick’s article “Validity” has been recognized as a fundamental contribution to the field of psychological testing and language testing (Bachman, 1990, pp. 236–295; Cháng Xiǎoyǔ, [2005] 2006; Chén Hóng, [1997a] 2006, [1997b] 2006; Fulcher and Davidson, 2007; Grotjahn, 2000, 2003; Kane, 2006; Kecker, 2010; McNamara, 2006b; Zhāng Kǎi, [2005b] 2006), and his “unitarian” view of validity has strongly influenced validation research (Kunnan, 1998; Kecker, 2010). However, this concept also comprises one major disadvantage: it is highly theoretical, and Messick does not provide practical advice on how to validate. Thus, some testing experts consider his concept to be too demanding and challenging for practitioners (Bachman, 2005; Kane, 2001, 2008; McNamara, 2006a), who are generally concerned with how to concretely implement their own validation research. Others argued that the focus on the construct is simply not convenient for validation studies; instead they preferred to rely mainly on content validity and reliability (Borsboomet al., 2004; Lissitz and Samuelson, 2007).

Another step forward towards implementing validation in practice was Messick’s article “Validity of Psychological Assessment” (1995). Here he explains that speaking of validity as a unified concept “does not imply that validity cannot be differentiated into distinct aspects” (ibid., p. 744). Validity, namely construct validity, can be divided into six aspects, addressing “central issues implicit in the notion of validity” (ibid., p. 744) because according to Messick, various inferences made from test scores would probably require different types of evidence.139 These six aspects of validity, implicit in the notion of validity as a unified concept, are content, substantive, structural, generalizability, external, and consequential aspects (Messick, 1989b, 1995).140 So, Messick abandoned the notion of different kinds of validity; instead he “invented” facets of a unified construct—namely facets of construct validity. But still one last problem had to be solved. In the process of validation, which resembles a mosaic (Messick, 1995), one has to decide which source of evidence to use and when. How can the right sources of evidence (facets of validity) ← 72 | 73 → that are needed in a specific validation procedure be puzzled together? The idea for solving this problem is to embed the whole validation process into a framework that logically connects the different facets of validity, as Messick stated: “What is required is a compelling argument that the available evidence justifies the test interpretation and use” (Messick, 1995, p. 744; emphasis added). The right combination of useful sources of evidence can be characterized as “prioritiz[ing] the forms of validity evidence needed according to the points in the argument requiring justification or support” (ibid., p. 747; Kane, 1992; Shepard, 1993) because validation is an evaluation argument (Cronbach, 1988, quoted in Messick, 1995). Messick describes this aspect in the following statement:

[V]alidation is empirical evaluation of the meaning and consequences of measurement. The term empirical evaluation is meant to convey that the validation process is scientific as well as rhetorical and requires both evidence and argument. … Evidence pertinent to all of these [six] aspects needs to be interpreted into an overall validity judgment to sustain score inferences and their action implications, or else provide compelling reasons why there is not a link, which is what is meant by validity as a unified concept. (Messick, 1995, p. 747; italics in original)

3.4 Validation of tests

Validation in CFL testing has not been clearly addressed until recently. Zhū Hóngyī (2009) explains how different facets of validity can be examined, but still fails to illustrate how these facets can be integrated and implemented in practice. However, other language testing experts from China, such as Lǐ Qīnghuá (2006), have clearly addressed different types of validation methods that have recently become widely adopted. He or she referred to the argument-based approach by citing, amongst others, Kane (1992, 2001, and 2002) and Bachman (2004), and by pointing to Weir’s (2005) evidence-based approach.141 All authors conform that validation is a “long process of collecting” evidence (e.g., Hé Fāng, [1994] 2006, p. 178). Zhāng Kǎi (2006c) describes this process by highlighting the role of the construct:

Proving and verifying whether the construct is correct is how to prove that the test indeed measures this ability to which the construct refers. And this is a question of construct validity. (ibid., p. 7)


The key in modern validation approaches lies in connecting test performance and its interpretations, a form of scientific inquiry. Thus, Cronbach (1971) says about validation: ← 73 | 74 →

To explain a test score, one must bring to bear some sort of theory about the causes of the test performance and about its implications. Validation of test interpretations is similar, therefore, to the evaluation of any scientific theory. (Cronbach, 1971, p. 443)

3.4.1 Kane’s argument-based approach to validity

What kind of validity evidence is needed when? And how can we combine them in a specific validation situation? These were the main questions Kane aspired to solve. Therefore, he aimed to provide “clear guidance on how to validate specific interpretations and uses of measurements,” or as he states “a pragmatic approach to validation” (Kane, 2006, p. 18). For Kane, the validation of an interpretation or use of measurement is “to evaluate the rationale, or argument” (Kane, 2006, p. 17). This basic idea is congruent with the Standards (Sireci, 2009, p. 28): score interpretations are by their very nature based, at least in part, on an argument. When we interpret, we do use an argument. For Kane, validation is an evaluation to be aimed at “the extent to which the proposed interpretations and uses are plausible and appropriate” (p. 17). Validation is the process of evaluating the plausibility of proposed interpretations and uses, and validity is the extent “to which the evidence supports or refutes the proposed interpretations and uses” (ibid.). The notion of an argument underlying the validation process was already mentioned by Cronbach (1988), Messick (1989b), and Kane (1990). Kane clearly depicts the interpretation of score based inferences; stakeholders and scholars should be able to reconstruct or to retrace how and why scores are interpreted in a specific way, and whether proposed interpretations are appropriate. Assumptions underlying interpretations and inferences should be explicitly formulated and stated (Kecker, 2010, p. 138). Thus, when evaluating the appropriateness of a proposed score interpretation we need a “clear and fairly complete statement of the claims included in the interpretation and the goals of any proposed test uses” (Kane, 2006, p. 22):

The proposed interpretations and uses can be specified in detail by laying out the network of inferences and assumptions leading from the test performances to the conclusions to be drawn and to any decisions based on these conclusions. (Kane, 2006, p. 22; cf. Crooks, Kane, and Cohen, 1996; Kane, 1992; Shepard, 1993)142

This model is responsive to differences in proposed interpretations and uses and to the context in which the scores are to be used. We can draw different inferences, depending on the specific context (Kane, 2006, p. 22). This concept can also be found in the Standards:

Validation logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed ← 74 | 75 → use. … Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use. The conceptual framework points to the kinds of evidence that might be collected to evaluate the proposed interpretation in the light of the purposes of testing (AERA, APA, & NCME, 1999, p. 9)

For example, if we make inferences from a Chinese test for reading skills created for the interpretation of reading ability in a Taiwanese context (e.g., the TOCFL), maybe we can draw inferences that a test taker has a certain skill in reading ability in Chinese, but only in a Taiwanese or Guóyǔ language setting context. This is because the Taiwanese test uses traditional characters143, and, to some extent, includes (and should include) words and phrases characteristic for Guóyǔ and representative for the use in Taiwan today (even this context is still very broad), e.g.,jìchéngchē計程車 for cab, or jiǎotàchē腳踏車 for bicycle, etc.144 Thus, inferences and interpretations of TOCFL scores have to be reinterpreted for another setting, e.g., if we want to draw inferences from TOCFL scores of a test candidate about his or her Chinese reading ability in Mainland China. Language test scores always have to be interpreted from a certain angle depending on the context. The core idea of an argument-based approach is that it requires an explicit angle, and scores always have to be interpreted in relation to it. Without that explicit angle we can gather all kinds of validity evidence, but the direction towards or the reason why we collect it (the purpose), as well as the perspective from which we interpret this evidence, might be unclear and diffuse.

Figure 7: Arrows representing inferences linking components in performance assessment.
From Kane, Crooks, and Cohen, 199, and Kane, 2006.

In his approach, Kane resorts to Stephen Toulmin’s model of argumentation (Toulmin, 1958, 2003). This model consists of a claim, for instance the interpretation ← 75 | 76 → of test scores. In turn, this claim is based on data, maybe scores or other manifestations of performance of a test candidate. The relation between the claim and the data has to be justified by a warrant, and the warrant itself has to be borne by the backing (empirical data of an investigation). In this argumentative chain, counterproposals against the argument can be brought up by rebuttals, which try to challenge or to weaken the argument or the interpretation of test scores.145 Kane (1992) lists three general criteria for evaluating an argument: the clarity of the argument, the coherence of the argument, and the plausibility of the assumptions (p. 528).

Figure 8 is an example for an interpretative argument warranting the claim that a student’s English speaking ability is not sufficient for studying in an “English medium university.” The initial ground for this claim is a student’s presentation, which is characterized by hesitations and mispronunciations. The warrant itself is backed by the training of the teacher and his previous teaching experience. On the other hand, there is a rebuttal which weakens our claim: the topic the student presented required very technical and unfamiliar vocabulary. This argument structure is the core concept for the validation study of the Test of English as a Foreign LanguageTM (TOEFL® iBT), a research project carried out by Chapelle et al. (2008).

Figure 8: Example structure of an interpretative argument about speaking ability.
From Mislevy, Steinberg, & Almond, 2003 (adapted by Chapelle et al., 2008, p. 7).

Based on Toulmin’s model of argumentation, Kane developed an argumentative chain for validating a test, where each argument builds the bridge for the next one, starting from a sample of observations, finally leading to the target score ← 76 | 77 → (Kane, 1992, 2001, 2002, 2006). An example for a sample of observations would be a performance sample of a participant in a Chinese language test. The target score tells us (or should tell us) how to interpret the score in the context of the target domain. A target domain, might be speaking Chinese in everyday life in a typical environment for non-native adults of Chinese language in the contemporary PRC. For developing a test, we should start from the target domain (Chapelle et al., 2008). The target domain is the basic ground for all other considerations, not only for test development, but also for test evaluation or validation after a test has been put into operation because it builds the framework for all other steps in the argumentative chain. A target domain might be defined very broadly (e.g., Carroll’s definition of intelligence), or quite narrowly (e.g., some skill on a specific task). The more accurately defined the target domain is, the better we can draw a representative sample out of it. This is especially true for language proficiency tests146, whose trait labeling—namely “language proficiency”—often suggests that some kind of an overall language proficiency across all contexts exists.

The next step in test development has to be the representative sampling of the target domain. For example, what kind of specific speaking situation(s), in which a foreigner usually uses Chinese, could be representative in today’s PRC? It might be a cab drive from the airport to the town center, or a visit to a museum or a restaurant, or bargaining at a Chinese market, etc.147 Thus, what to include into or exclude from the target domain (for the sampling) is based on value judgments, some of which are implicit, and some are explicit (not only values, of course; empirical needs analyses are important, too). But weighting the different elements of the sample in developing a language test is a crucial point.148 Kane denotes these ← 77 | 78 → judgements for the test sampled target domain as universe of generalization (Kane, 2006). Finally, when we test, we want to estimate the target score. The target score should tell us as precisely as possible how well our candidate would perform in the target domain. We can estimate the target score based on an inferential bridging starting from the performance sample of the test taker, which yields our observed score, that we have to generalize to the universe score (Brennan, 2001a, 2001b; Cronbach et al., 1972; Shavelson and Webb, 1991). In the end, the universe score is extrapolated to our target score.

Kane’s approach includes two steps. The first step is to build an interpretive argument which involves an argumentative chain. This chain consists of the “inferences leading from observed performances to the claims based on these performances” (Kane, 2006, p. 23). In a second step, validation studies and research data are used to rebut or to warrant this argumentative chain; this is referred to as a validity argument.

For standardized tests, Kane emphasizes that “the universe of generalization is a restricted subset of the target domain” (2006, p. 31). But in what sense is it limited? He gives the example of the target domain for adult literacy, which already slightly resembles the problems faced when one aims to measure the reading skills in CFL language proficiency testing:

While the target domain for adult literacy would include a very wide range of written material (e.g., novels, instructional manuals, magazines […]), responses (answering specific questions, giving an oral or written summary, taking some action based on manual or sign), and contexts (e.g., at home, in a library, at work, or on the road), the universe of generalization for a measure of literacy may be limited to responses to objective questions following short passages while sitting at a desk or computer terminal. In most contexts, the reader can start and stop at will; in the testing context, the reader is told when to begin and when to stop. The performances involved in answering questions based on short passages under rigid time limits are legitimate examples of literacy but they constitute a narrow slice of the target domain for literacy. (Kane, 2006, p. 31)

Therefore, especially in objective testing, it is not self-evident whether we can extrapolate scores to the target score. A common skepticism of test takers against standardized tests in general, and also an allegation of test takers of the old HSK, maintains that it is not possible or logical to simply extrapolate from HSK scores to the target domain, namely “Chinese language proficiency” (whatever this maybe). ← 78 | 79 → From a certain point of view, this skepticism is legitimate. Kane (2006) makes this misgiving explicit in the following statement:

As a result of standardization, the samples of tasks included in measurements are not random or representative samples from the target domain, and it is not legitimate to simply generalize149 from the observed score to the target score. It is certainly not obvious, a priori, that performance on a passage-based objective test of literacy can be extended to the target domain of literacy, even if the observed scores are consistent over replications of the measurement procedure […]. (ibid., p. 31)

And therefore, he concludes:

[T]he interpretation of observed performance in terms of the target score requires a chain of reasoning from test results to an observed score, from the observed score to the universe score, and from the universe score to the target score. (ibid.; italics added)

This is another very clear statement in which Kane requires revealing implicit score interpretations underlying the interpretational chain—the chain of reasoning. Implicit interpretations must be made explicit in an argumentative approach. If there is a logical gap in the reasoning, the score interpretation might be invalid.

The definition of the trait one intends to measure is a crucial point in testing. In language proficiency testing for CFL, we have to ask what has to be included in CFL proficiency. We know that the trait interacts with the target domain (Kane, 2006, p. 33), but what can we do if the target domain is not specified very well? In the case of the HSK, information about the target domain is partly contradictory (see section 2.3 and 4.1). Besides the problem of specification, even if we specify the target domain more thoroughly, we have to face the difficulty that some trait implications might go beyond the target domain because (a) there are other possible existing implications (Cronbach, 1971, p. 448), and (b) many trait labels were long in use before anyone decided to measure them (Bruner, 1990). The concept of “universal” language proficiency is a very good example for a trait concept that has already been used for thousands of years. Assumptions about traits are connected to our experience. If we add the time dimension, we can distinguish between traits that remain quite stable over time150, e.g., general mental ability, or traits that change over time, e.g., moods (Kane, 2006, p. 32). Language proficiency is a trait that can stay relatively stable over time; however, it can improve due to effective instruction or deteriorate if a second language learner has not used a certain language for some time he had learned years ago. In chapter 5, I will investigate the assumption whether the amount of training in Chinese language classes has a positive influence on Chinese language proficiency.

Trait labeling is closely connected to associations, which also influences test development. Test developers tend to give “generalized abstract names to variables” (Cook and Campbell, 1979, p. 38), and as a result, trait labels “may make implicit ← 79 | 80 → claims that the trait can be interpreted more broadly” (Kane, 2006, p. 32). In CFL proficiency testing, the trait “Chinese language proficiency” used for the HSK, the TOCFL, and many other tests are excellent examples for this phenomenon because in all three tests, the labeling invites the user to conjecture about what to include in this label. Speculations, associations and images in such a broad field as CFL proficiency (and often in language proficiency testing in general) naturally have the tendency to be very heterogeneous. They might range from the numerous contexts in which Chinese can be used to questioning to what extent Pǔtōnghuà should be taken as the standard. The last point is of special interest in CFL because it involves the diversity of Pǔtōnghuà and of languages in general since some CFL linguists argue that Pǔtōnghuà is a theoretical construct that one almost never comes across in “authentic” real-life situations.151 Mainland Pǔtōnghuà and Taiwanese Guóyǔ, the standards of Chinese in the PRC and the Republic of China respectively, considerably differ in real language praxis. On the other hand, their theoretical definitions make it difficult to see how they actually differ because Pǔtōnghuà “[l]ike guóyǔ before […] is a standard language based on the dialect of Peking” (Norman, 1988, p. 137; Ramsey, 1989, p. 15; cf. Chen, 1999; DeFrancis, [1984] 1998; List, 2009).In addition, even within these standards it is sometimes unclear what does belong to the standard and what not, e.g., on the one hand the phonology of Pǔtōnghuà and Guóyǔ has been strictly codified (Peking pronunciation as the standard; cf. Norman, 1988, p. 137)—same with the Chinese characters—but on the other hand, no codified grammar exists, including today, and the usage of words between the HSK and the Taiwanese TOCFL indeed differs tremendously: Zhāng Lìpíng (2007) has computed an overlap of only 4,797 words between the word syllabi of the old HSK and the TOCFL, meaning that this overlap accounts for 55 and respectively 60% of the HSK’s and TOCFL’s word syllabi (cf. Meyer, 2012, p. 120; Zhāng Lìpíng, 2007; cf. footnotes 106, 125, and 144).The problems addressed in the question for defining the standard of the Chinese language cannot be solved in this work;however, the aspect of language variety in Chinese in terms of non-standardized pronunciation as well as the context of language use of native ← 80 | 81 → speakers of Mandarin in terms of grammar, words, etc. unmistakably reveal that value judgments are an integral and an inevitable part of trait labeling in CFL testing. Thus, trait labels and descriptions typically involve value judgments that influence the evaluation of the proposed interpretations and uses of the test (Kane, 2006, p. 32), and therefore, we have to make these value implications explicit (Messick, 1989b).

Kane (2006) developed a figure which schematizes how to use the interpretive argument in testing. This schema is also very useful for CFL because one can understand which steps and what kind of reasoning are inherent in testing and CFL assessment. Therefore, I have adopted Kane’s model and added examples addressing issues typical for CFL. The labeling of the trait, the definition of the target domain, and the universe of generalization are strongly connected and partly intertwined, especially in standardized testing. If we develop a test for CFL proficiency, we should initially depict the target domain as clearly as possible. In other words, we have to consider the target language situation (e.g., Bachman and Palmer, 1996, p. 95ff). For example, if we are designing a test to tell us whether a learner of CFL has achieved adequate language ability to study in a “typical” Chinese B.A. program at a Mainland Chinese university, deciding what should be included in the target domain would be the next step.

First, several subsequent questions must be raised. We should imagine typical situations for a foreigner studying at a Chinese university in a B.A. program (context). How would foreign students probably use the Chinese language? We could ask about typical texts students have to read in Chinese B.A. programs. We should answer which specific language skills are needed when. Then we can ask in more detail when a specific skill, such as listening comprehension, is needed. We should ask in which situations foreign students would be confronted with listening to Chinese (lectures, seminars, dining hall, dialogs with other students, the librarian, etc.). How often would students have to cope with varieties of Pǔtōnghuà (accents), and how strong could these variations presumably be? Would reading comprehension require students to read Chinese handwriting to a certain extent, or would it be sufficient if they could “merely” read printed or machine-typed texts?

When we try to answer questions which help sharpen the target language domain, we can see that the boundaries of the target domain are rather fuzzy, and the situations foreign students might be confronted with seem to be infinite. Besides, there will probably not only be one trait, but rather a mixture of variables, underlying the notion of CFL ability for studying in Chinese B.A. programs. Our goal is to find a blueprint illustrating which situations and what kind of abilities to sample, and what kind of language tasks a student would have to master. Here, it becomes clear that it is impossible to model the target domain in an “objective” way because we have to make almost innumerable value judgments leading to final decisions. In addition, we have to consider our resources in the test development process. The universe of generalization, which should resemble the target domain as much as ← 81 | 82 → possible, depends strongly on our technical resources. Thus, testing is always a trade-off. Do we have the testing facilities to test listening skills (which should be no problem in a high-stakes test)? For instance, if students in Chinese B.A. programs have to occasionally write an essay, our test should require an essay section. But perhaps we do not have enough adequately trained raters, so practical considerations will limit our universe of generalization even more.

On the other hand, there is one general rule underlying almost all considerations: representativeness. Our picture of the target domain should try to be as representative as possible for modeling the “average” foreign student who lives at an “average” Chinese university, studying an “average” Chinese B.A. program. As done in several validation studies supporting the new TOEFL iBT, it is possible to approximate this average situation to some degree. This can be achieved by observing campus life of foreign students studying in Chinese B.A. programs (naturalistic observation), asking students or teachers what language requirements the students have to fulfill in their daily lives (statistical survey), and investigating curricula or texts students have to read (content analysis), etc.

Figure 9: Measurement procedure and interpretive argument for trait interpretations.
From Kane, 2006, p. 33. Slightly adapted.

In Figure 9, the interpretive argument consists of four major inferences: scoring, generalization, extrapolation, and implication. The first inference we draw when using an interpretive argument in testing is to evaluate our observation, which is the performance of the test taker that is measured or quantified. We obtain a raw score ← 82 | 83 → or scaled score by using a scoring rule. This inference is called scoring (Kane, 2006) or evaluation (Chapelle et al., 2008, 2010). The obtained quantified performance is our observed score. The warrant for this inference should aim for evidence that the scoring rule is appropriate. Appropriate scoring criteria are based on the judgments of experts who develop and review the test, the care with which the scoring procedures are implemented, and the procedures related to selecting and training scorers (Clauser, 2000). An example for empirical evidence backing the scoring procedure would be to check on the scoring consistency and accuracy. The fit between the model and the equated scores can be evaluated empirically. However, the warrant for scoring can be undermined by numerous factors. Kane lists, e.g., scoring rubrics reflecting inappropriate criteria, rubrics failing to include some relevant criteria, and flawed selection or training of scorers and scoring control procedures (Kane, 2006, pp. 34–35). The scoring is one of the strengths of the old HSK because the old HSK was a highly standardized, high-stakes test consisting almost exclusively of multiple-choice items152, which means that the items were closed-ended. In addition, the old HSK was a norm-referenced test. However, the HSK also showed features of a criterion-referenced test (cf. chapter 4). Raters and rater training was not so important for the old HSK because rating scales were merely used in the Advanced HSK. The essay subtest (xiězuò写作) and the SOPI subtest (kǒuyǔ口语) assessed oral productive and written productive skills. The second inference (generalization) inquires to what extent parallel forms of the test also measure the intended construct. To draw this inference, a representative sample must be gathered from the universe of generalization because the generalization inference leading to the universe score principally takes the step backwards from the universe of generalization to the sample of observations. Thus, the universe score in Figure 9 runs parallel to the universe of generalization, and a dashed line links both. If the test and other existing parallel forms are representatively sampled equally well, one has to note that estimates of the universe score still include a sampling error. Empirical evidence that supports the generalization inference are reliability studies (Feldt and Brennan, 1989; Haertel, 2006) or generalizability studies (Brennan, 2001b; Cronbach et al., 1972). Generalization depends on the warrant that test scores are comparable across test events, which means that the conditions of observations have to be consistent with the measurement procedure. If these conditions “involve impediments to performance” (e.g., faulty equipment) or ameliorate the performance (inappropriate aid), the generalization inference is weakened (Kane, 2006, p. 35). The extrapolation inference attempts to predict the performance of the test candidate in the real life situation using the target score (see the arrow in Figure 9). For this inference, the relationship between the universe of generalization and the target domain is crucial. Performances on test tasks should not substantially differ ← 83 | 84 → from the performances in the target domain. Here, the notion of face validity can play an important role. If test takers do not take the test seriously, this might weaken the extrapolation inference.153 Correlations with another criterion that measures the same construct can provide backing for the extrapolation inference (Kecker, 2010, p. 139). The extrapolation inference can be extended by another aspect: explanation. The goal of explanation is to determine a theoretical model underlying the candidates’ performance (Kane, 2002, p. 31). The implication inference concerns the trait label—and proposed uses of the test scores. Both often contain implications that go beyond the definition of the target domain. Therefore, trait implications extend the interpretation of scores beyond a “simple inductive summary” (Kane, 2006, p. 37). One important question concerns how adequately the target domain suits the assumptions underlying or associated with the trait (Cook and Campbell, 1979). Empirical investigations of some specific implications of the trait can check theoretical conceptions of the trait. Kane mentions the trait’s change over time. In the case of CFL, there is a common belief that the trait—ability in CFL or CFL proficiency—is somehow related to the exposure of hours in class of the learner. As he notes:

[I]f the trait is expected to vary as a result of some intervention, change in the expected direction would support the proposed interpretation. (Kane, 2006, p. 37)

So, the intervention in this example would be the time spent in class, and the expected direction would be more time spent, and a learner should gain more proficiency (cf. chapter 5). The last inference in testing is the decision inference (Kane, 2006, p. 24). Chapelle et al. (2008) say that “[d]ecision-making links the target score to the decisions about test takers for which the score is used” (ibid., p. 12). This inference is also referred to as utilization (Bachman, 2005). The decision-making inference differs from the other inferences because it adds a new dimension to the interpretive argument, the dimension of score use.154 Decisions depend on value assumptions (Kane, 2006, pp. 24 and 51). Adjoining this inference means implementing Messick’s requirement to include the issue of consequences in the validation process. Decisions have to be evaluated in terms of their outcomes or consequences; in addition, a policy is needed to execute the decision inference (Kane, 2006, p. 51). This is a very important inference in test validation because it reveals the perception or weltanschauung (or worldview) underlying the decision-making process, namely the policy, as Kane denotes:

Policies are not true or untrue, accurate or inaccurate. They are effective or ineffective, successful or unsuccessful. A policy that achieves its intended goals (positive consequences) ← 84 | 85 → at modest cost, and with few undesirable side effects (negative consequences) is considered a success. A policy that does not achieve its goals (lack of positive consequences), and/or that involves relatively high cost or produces significant undesirable side effects (negative consequences) is considered a failure. (Kane, 2006, p. 51)

The core problem whether a specific consequence tends to be more positive or negative is purely a value judgment that can be supported by empirical claims. Furthermore, Kane (2006) distinguishes between semantic interpretations and decisions. A semantic interpretation draws conclusions based on assessment results and it assigns meaning to them. Mostly, semantic interpretations come first, and the decision follows. Thus, semantic interpretation and decision are distinct and sequential155 (Kane, 2006, p. 51).

The key issue in validating the decision rule is how (or where on the score scale) to define the cutscore. For identifying a reasonable cutscore we can conduct a standard-setting study, for example. The purpose of such a study is to back the choice of the cutscore (Hambleton and Pitoniak, 2006), which has to be related to a concept of (minimal) level of competence, the so called performance standard. The cutscore is an operational definition of the decision rule; if a performance standard is set, empirical evidence can be used to evaluate how well the cutscore represents this standard (Hambleton and Pitoniak, 2006; Kane, 1994).

Consequences already have been a part of concepts of validity for a long time (Guion, 1974; Messick, 1975, 1989b, 1998; Shepard, 1997), and traditional notions of validity are also connected to consequences because these concepts often addressed how well a test finally achieves its goals (Cureton, 1951; Cronbach and Gleser, 1965; Linn, 1997; Moss, 1992). Thus, consequences are closely related to questions of test fairness and equity. These issues arose during the 1960s in the United States after the Civil Rights Movement started (Cole and Moss, 1989; Ebel, 1966), and from this time on, testing had to aim on ensuring fairness for all test takers across different groups, especially for racial minorities in employment testing (Kane, 2006, p. 54). Because test developers usually want to back or to validate the claims they make, Kane proposes that test users should play the main part in analyzing the consequences of test use because they are in the best position to evaluate the outcomes of testing:

Test users identify the kinds of decisions to be made and the procedures to be used to make these decisions (Cronbach, 1980b; Taleporos, 1998). They presumably know the intended outcomes, the procedures being employed, and the population being tested, ← 85 | 86 → and therefore, they are in the best position to identify the intended and unintended consequences that occur. (Kane, 2006, p. 55)


ISBN (Book)
Publication date
2014 (March)
Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien, 2014. 349 pp., 41 b/w fig., 129 tables

Biographical notes

Florian Meyer (Author)

Florian Meyer studied Sinology, Communication Science and Korean at Free University Berlin, and Chinese at Peking University. He worked as a lecturer for Modern Chinese at Ruhr University Bochum (Germany), where he studied Language Teaching Research and completed his PhD.


Title: Language Proficiency Testing for Chinese as a Foreign Language