Social perspectives on language testing

Papers in honour of Tim McNamara

by Carsten Roever (Volume editor) Gillian Wigglesworth (Volume editor)
Others 254 Pages
Series: Language Testing and Evaluation, Volume 41

Table Of Content

Bernard Spolsky

The slow recognition of the social impact of language testing

The shibboleth test was perhaps the first recorded test with a strong social impact. As commonly cited by language testers with a sense of the Bible, it was an oral test administered during the war between the Gil’adites under the judgeship of Yiftah and the Efratites during the time of the settlement of the land:

“So Yiftah passed over to the children of ‘Ammon to fight against them: and the Lord delivered them into his hands. And he smote them from ‘Aro’er, as far as Minnit, twenty cities, with a very great slaughter… And Gil’ad seized the passages of the Yarden before Efrayim: and it was so, that when the fugitives of Efrayim would say, let me cross,; that the men of Gil’ad would say to him, Art thou an Efratite? If he said, No; then they said to him, say now Shibbolet: and he said Sibbolet: for he could not frame to pronounce it right. Then they took him, and slew him at the fords of the Yarden; and there fell at that time of Efrayim forty two thousand. (Judges, 1977, pp. 11–12).1

Robert Lado quoted this story at the beginning of his doctoral dissertation (Lado, 1949, p. 6). Alan Davies (1992) also cited it, and judged the test to be “quite useless, too powerful”, discounting the bureaucratic preference for binary “pass-fail” testing. McNamara (2005) used the word Shibboleth in the title for his attack on the use of language tests to identify the origin of asylum seekers, a process he joined others in condemning both for doubts about its accuracy and its ethics (Eades, Helen, Siegel, McNamara, & Baker, 2003).

While for a long time language testers ignored the ethical or social dimension of their products, many earlier tests had obvious social or political roles. The Imperial Chinese Examination had a political goal, to select mandarins who could be trusted by the Emperor because they were not dependent on nepotism or favor from lords. The first European recognition of the power of testing was perhaps by Matteo Ricci (1942), a Jesuit priest, who brought the idea of testing to Europe from his time in China. In France, the result was weekly examinations introduced into Christian schools to manage the progress of teaching (de la Salle, ←17 | 18→1720, 1838; Madaus, 1990). In this educational purpose, the testing matched the traditional annual examinations at Oxford and Cambridge Universities, given at first orally and, when students’ knowledge of Greek and Latin weakened, in writing (Clarke, 1959, pp. 98–99).

But of course education is also social and political. Foucault, perhaps biased by his own failures in the agrégation examination (Eribon, 1991, pp. 36–38), was deeply concerned about the development of “l’âge de l’école ‘examinatoire’ ”, which he saw as the beginning of education as a science and source of power (Foucault, 1975, pp. 189). In his book, which deals with torture, punishment and discipline, he says that examinations provide “a normalizing gaze, a surveillance that makes it possible to qualify, to classify and to punish (1975, pp. 186–187)”. Just as the daily examinations of patients in a hospital gave doctors power over the institution, and made the hospital “a place of training and of the correlation of knowledge (1975, p. 188)”, so the regular examinations of the eighteenth century French Christian schools meant that knowledge flowed back from pupils to teachers, establishing the science or discipline of education.2 This transformed pupils from children who needed help to objects under control and data for science. School examinations, then, can be seen as having an important social and political dimension.3

The ethical purpose was emphasized in nineteenth century Britain when Macaulay (1853) proposed the application of what was called the “Chinese principle”4 to the selection of cadets for the Indian Civil Service. The Indian Civil Service Examination introduced into nineteenth century Britain was an elite selection process. Macaulay first raised the notion in the 1833 debate on the East India Company Bill, suggesting it as a replacement for the patronage system whereby members of the directorate (who were also government ministers) nominated and appointed cadets to the service. Cadets entered a career for potential nabobs, who could hope to return from the Orient with a fortune made through the East India Company. His proposal was rejected by Parliament, but twenty years later, he again raised it in the debate on the India Bill, noting in his ←18 | 19→speech the number of men of distinction (governors-general of India, lawyers, judges, ministers) who had first made their mark in the competitive examinations of Oxford and Cambridge. While some doubts were expressed, the Act passed, but it was only in 1858 that the first twenty-one cadets were selected, out of sixty-seven candidates, on the basis of an examination which included English language, literature, and history; language, literature and history of Greece and Rome and of France, Germany and Italy; Sanskrit and Arabic language and literature; and mathematics and natural and moral sciences (Roach, 1971). By 1865, it was reported that most of the successful candidates had been educated at the best universities (Oxford, Cambridge, London, Edinburgh and Trinity College Dublin), although one Brahmin of high caste had passed because of his knowledge of Sanskrit and Arabic (Roach, 1971).

With the new public demand for testing, both Oxford and Cambridge were persuaded to appoint Syndics to develop university entrance examinations; at the request of parents in Exeter, these examinations were conducted locally rather than just at the universities. A few years later, examinations were also being used to recruit for the Prussian and French senior civil service, and in Britain in 1870 for senior positions in the Home Civil Service. The popularity of examinations for selection was satirized by Gilbert and Sullivan in their opera Iolanthe (1882), in a suggestion that Dukes would soon be selected “by competitive examination”. The Oxford and Cambridge entrance examinations, spread beyond the university towns by being administered locally, became an important method of recognizing the academic achievements of elite schools, but the existence of a number of Local Examination Boards avoided the uniformity that developed in France, where the Catholic schools had been taken over by the State and used for central control by Napoleon.

But there was some criticism of examinations, both ethical and technical. One critic was Henry Latham, master of Trinity College, Cambridge, who attacked examinations as an “encroaching power” that was forcing students to narrow their focus and make use of crammers and cramming schools. He was afraid that teaching in England, like in France, was becoming subordinate to examination. He also drew attention to technical problems: some examinations were marked by impression, and standards varied. But his main concern was the effect on teaching (Latham, 1877).

A major attack on the technology of examinations came a decade later, with the appearance of two papers by Professor Francis Ysidro Edgeworth. Edgeworth was born in Ireland in 1845 to a Catalan mother, and educated at Trinity College, Dublin and Balliol College, Oxford where he studied ancient and modern languages. He was called to the Bar, but did not practice, and set ←19 | 20→out to learn economics and mathematics. He later held chairs in economics at Kings College, London and the University of Oxford, becoming founder and editor of The Economic Journal in 1891. He published important pioneering work applying mathematics to economics, and was elected President of the Royal Statistical Society in 1912. Among his many publications, two papers raised serious challenges to the objectivity and fairness of public examinations. In the first (Edgeworth, 1888), he applied probability theory to examinations, arguing that marks should be established by averaging the judgements of several different competent critics; this “cumulation of erroneous observations” would “approximate to the truth” (1888, p. 602). He concluded that an examination was “a sort of lottery” in which the chances were better for the deserving. In a second paper, Edgeworth (1890) analyzed a number of marked essays, also comparing the marks given by the same examiner on two occasions; he found the probability of error was between 6 % and 10 %, and showed that a candidate needed to score 24 % above the honors level to be confident of the result. He concluded that there was “unavoidable uncertainty” in examinations, so that it was wiser to report classes of pass (like at Oxford) than to attempt to establish ranking (like at Cambridge). He listed the many sources of uncertainty, including the health of the candidate or marker and the suitability of the questions. Essentially, the challenge he set was to find the best way to use uncertain results. Instead, a new and growing discipline of psychometrics set itself the task of improving the accuracy of what soon developed into a highly profitable testing industry.

A strong defense of testing was being developed by a number of European and American scholars, believers in the need and possibility to measure human mental abilities. One was Francis Galton, a cousin of Charles Darwin and one of the founders of the fields of statistics and eugenics. He believed that mental as well as physical characteristics could be reduced to numbers (Galton, 1883). It was an American, James Cattell, who applied this notion to education, developing a large battery of tests (Cattell, 1890) and trying to make them as accurate and fair as possible; he criticized the irresponsible unfairness of school grades awarded without such “scientific” care (Cattell, 1905). A French doctor, Alfred Binet, developed a series of tasks of varying difficulty that could be used to determine the age at which a “normal” child could perform them, and so establish the mental age (the intelligence) of individuals (Binet & Simon, 1916). Clustered into a single score by a German psychologist (Stern, 1914) to give the Intelligence Quotient, Binet’s work was further boosted by developments in statistics such as factor analysis that permitted the measurement of a general factor (g). Though disputed, the IQ was popularized by the American psychologists, Henry Goddard, who persuaded the American Association for the Feeble-minded to adopt the ←20 | 21→Binet tests and Lewis Terman, who developed the revised Stanford-Binet test. The test, Terman (1916) argued, would permit identifying the retarded and so reduce the crimes for which they were responsible. These founders of objective testing and psychometrics had, it is clear, socially respectable goals, even if their work showing inequality often led to ethically doubtful results such as eugenics and resulted in the kind of objectivization of human beings that Foucault spoke against.

Public acceptance of the new tests came after the first world war. In England, Cyril Burt (1921) psychologist for the London County Council, introduced the general use of the Binet-Simon scale.5 In the United States, the rapid growth of school and psychological testing was bolstered by the public relations claims of Robert Yerkes, who asserted that his Army Alpha tests had contributed to the American war effort (Yerkes, 1921; Yoakum & Yerkes, 1920); this, in spite of the fact they were largely ignored by the Army. Publication led to claims that recent immigrants from Southern and Eastern Europe were less intelligent than those from Northern Europe, although this was an effect of the high correlation between test results and years in the US. His arguments encouraged American anti-immigrant xenophobia; as Expert Eugenics Agent to the Congressional Committee on Immigration, he played an important role in developing the formula applied in the 1924 Immigration Act.6

Criticism of examinations continued. In the United States, the journalist Walter Lippman was an outspoken opponent: he worried about how norms were set, what was really being measured, how group tendencies were applied to individuals, how children were labelled as a result, and how the IQ claimed to be measuring something inherited. But even he admitted that tests helped children fit into school (Block & Dworkin, 1976). Lewis Terman responded, disclaiming responsibility for the misuse of tests, a response similar to that made by makers ←21 | 22→of cigarettes and guns. In Britain, one of the most serious critics of examinations that involved essays was Sir Philip Hartog, a chemist and educator, who published several studies showing the lack of reliability of the marking (Hartog, Ballard, Gurrey, Hamley, & Smith, 1941; Hartog & Rhodes, 1935, 1936).

But these criticisms were already anticipated by the many psychologists who developed the field they had named “psychometrics”, and was essentially met by the growing reliance on what was called “objective testing”, the use of batteries of questions with what were claimed to be single correct answers. These “new-type” tests, either true-false or multiple choice, were introduced into language testing by Daniel Starch (1913, 1916), who printed foreign language tests in Latin, French and German: each had a list of foreign words to be matched with an English translation and a group of sentences to be translated into English. The American development was matched in Britain by the work of Cyril Burt. Other early objective language tests were those in Spanish and English (Handschin, 1919) and in French (Henmon, 1921). Objective language tests became widespread, especially those developed by Vincent Henmon as the key instruments in the Modern Foreign Language Study in the 1920s (Coleman, 1929). Objective tests provided good data for the statistics used to establish reliability, and were faster and cheaper to administer than the marker-based assessments of compositions and unseen translations.

Examinations quickly became popular as a method of checking on the individual accomplishments of school pupils, something which was condemned as focusing and restricting teaching.7 The possibility of using tests to block access to potential immigrants was recognized by the Kenyan and Australian governments in using a dictation test to ban individual immigrants,8 and picked up by the US Commissioner of Immigration in his requests for an examination capable of keeping out people exploiting a loophole in the 1927 Immigration Act that allowed potential students entry, avoiding the limitations of the act (Spolsky, 1995a). This led to the development of three English tests (TOEFL was the third). which continue to have social impact in limiting access to US universities. There ←22 | 23→is now major competition from Cambridge and Pearson examination batteries; other tests are used by many European nations in order to control university admission, immigration, and citizenship.

Apart from some nineteenth century objections, test makers and psychometrists ignored social impact, trying rather to reduce the “inevitable uncertainty” of examinations. The breakthrough came when late in the twentieth century Samuel Messick included social values in his definition of test validity. Since then, language testing theorists, such as Bachman and McNamara, have given a stronger role to the social impact of tests, condemning especially the abuse of testing in immigration decisions, and its harmful effects on university admission, and language testers have argued for ethical use. But looking closely, apart from a growing number of universities in the USA which no longer use standardized industrialized tests to control admission, the world-wide testing industry, like the US gun industry, appears to be able to continue to operate profitably.

While tests have regularly had social goals and consequences, the social dimension was only recognized in psychometric research in the definitions of validity proposed and explored by Lee Cronbach (Cronbach & Meehl, 1955) and Samuel Messick (1980). For language testing in particular, there were proposals to consider the social dimension, influenced by the anti-Chomskian linguistics of Hymes (1972), in an article on designing tests for the study of bilingualism in the New Jersey Barrio (Cooper, 1968) and in a paper discussing the social bias of using English tests as a qualification for admission of foreign students to US universities (Spolsky, 1967). The social dimension was also stressed in a number of papers dealing with ethics in language testing (Fulcher, 1999; Hamp-Lyons, 1997; Shohamy, 1997; Spolsky, 1981, 1984, 1997; Stansfield, 1993), which resulted in the writing and publication of ethical standards by the language testing profession (International Language Testing Association, 2000). But as McNamara recognizes (McNamara & Roever, 2006), the breakthrough in language testing theory was the work of Lyle Bachman (1990), who introduced the Hymes model, as it had been explained in a pioneering article by Canale & Swain (1980), in considering the social context by recognizing language use situations.


Tim McNamara’s work has had a fundamental impact on language testing. This volume brings together over 20 leading scholars in language assessment whose work has been influenced by Tim McNamara. Their papers cover issues of the social impact of language tests, such as fairness and justice of test use and language testing in the context of migration. They also address testing of interaction, and teachers’ and students’ views of language tests. The volume concludes with papers discussing the future of language testing in the face of contested concepts of validity, the rise of social media, and lingua franca language use.


