Studies in Learner Corpus Linguistics

Research and Applications for Foreign Language Teaching and Assessment

by Erik Castello (Volume editor) Katherine Ackerley (Volume editor) Francesca Coccetta (Volume editor)
©2016 Edited Collection 358 Pages
Series: Linguistic Insights, Volume 190


This volume explores the potential of using both cross-sectional and longitudinal learner corpora to investigate the interlanguage of learners with various L1 backgrounds and to subsequently apply the findings to language teaching and assessment. It is made up of 18 chapters selected from papers presented at the international conference «Compiling and Using Learner Corpora», held in May 2013 at the University of Padua, Italy. The chapters discuss current issues and future developments of the use of learner corpora, present case studies based on teaching and assessment experiences in various contexts, and longitudinal corpus-based studies conducted within the Longitudinal Database of Learner English (LONGDALE) project. Other chapters report on investigations of specific aspects of the interlanguage of a variety of learner populations, and the last ones address issues of corpus compilation and representativeness. The majority of the contributions draw on data produced by EFL learners from Germany, Italy, Japan, Spain, and the Netherlands, while others concern learners of Italian and Spanish as Foreign Languages.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the author
  • About the book
  • This eBook can be cited
  • Contents
  • Introduction
  • Section 1: Learner Corpora for Language Teaching and Assessment
  • Using Learner Corpora in Language Testing and Assessment: Current Practice and Future Challenges
  • Dealing with Errors in Learner Corpora to Describe, Teach and Assess EFL Writing: Focus on Article Use
  • Using Learner Corpora to Order Linguistic Structures in Terms of Apparent Difficulty
  • Focus on Form in Computer-Mediated Communication: Using Written Learner Data to Foster Language and Pragmatic Skills in Communicative Contexts
  • The Compilation and Use of a CMC Learner Corpus for Japanese University Students
  • Section 2: Longitudinal Learner Corpus-based Studies
  • Introduction to the LONGDALE Project
  • Nouns and Noun Phrases in Advanced Dutch EFL Writing: From Quantitative to Qualitative Longitudinal Data Analysis
  • I didn’t really *understood what it was about, but it really *made fun: A Longitudinal Corpus-based Study of Tense/Aspect and High-frequency Verbs in Learner English
  • Assessing Advanced EFL Students’ Proficiency at Producing Affect-laden Discourse
  • Towards a Longitudinal Study of Metadiscourse in EFL Academic Writing: Focus on Italian Learners’ Use of it-extraposition
  • Short-term Effects of Students’ Exploration of Corpora: A Longitudinal Study of Pre- and Post-modification of Noun Phrases in Learner English
  • Section 3: Language Corpora for the Analysis of Interlanguage
  • Analysing the Language of Interpersonal Relations in Corpora of Elicited Learner and Native Speaker Interactions in English
  • From Learner to Expert: Using a Corpus to Analyse the Use of Must by German Advanced Students of English
  • Spanish Copulas and the Interlanguage of Iraqi University Students
  • Phraseology in Academic L2 Discourse: The Use of Multi-words Units in a CMC University Context
  • Section 4: Learner Corpus Compilation and Representativeness
  • Representing Learner English in a Specialized Corpus: Genre and Proficiency Level in the Advanced Learner Corpus of Argumentative Student Essays (ALCASE)
  • Connecting Data Elicitation and Pedagogical Practice in Learner Corpus Design: The Case of TILCE – the Turin Italian Learner Corpus of English
  • A Generic Data Workflow for Building Annotated Text Corpora
  • Notes on Contributors

← 8 | 9 →



This volume presents studies based on the compilation and analysis of learner corpora, that is electronic collections of authentic, continuous and contextualised foreign or second language texts produced by learners and assembled according to explicit design criteria (Granger 2009: 14). Learner Corpus Research (LCR) has developed considerably over the last twenty years, as testified by the many learner corpora that have been compiled all over the world1, by the foundation of a specific international academic association, the Learner Corpus Association2, as well as by the many conferences, publications and, recently, a journal3 specifically devoted to this sub-field of corpus linguistics (e.g. Granger/Gilquin/Meunier 2013, Callies/Götz 2015, Callies/Paquot 2015). Such research has been conducted by scholars in various disciplines, including Second Language Acquisition (SLA), Language Testing and Assessment (LTA) and Foreign Language Teaching (FLT), with a wide variety of aims, for example, to target the needs of specific groups of language learners and evaluate their performance more precisely. In spite of the many advancements, however, there are some theoretical and methodological aspects of this field of research which are still in need of investigation, especially in view of the successful application of the findings to the target contexts. As claimed by Granger (2009: 28),

we need more learner corpora – particularly longitudinal ones – representing a much wider range of genres, tasks and learners in a wider range of languages. Secondly, it is time to start thinking seriously of a standardized markup and ← 9 | 10 → annotation system and a purpose-built architecture for storing, annotating and searching learner corpora […]. Thirdly, there is a need for both thorough analyses of learner data based on solid theoretical underpinnings and to design pedagogical tools which meet the realities of the teaching and learning context.

This volume takes up some of these issues, in the attempt to explore the potential of using both cross-sectional and longitudinal learner corpora to investigate the interlanguage of learners with various L1 backgrounds, and to subsequently apply the findings to language teaching and assessment. Issues of corpus compilation and representativeness are also addressed.

The 18 chapters that make up the volume originated in papers presented at the international conference “Compiling and Using Learner Corpora”, held in May 2013 at the University of Padua, Italy. They reflect the variety and range of topics and approaches that were discussed during the conference, and are presented in four separate sections: learner corpora for language teaching and assessment; longitudinal learner corpus-based studies; language corpora for interlanguage analysis; learner corpus compilation and representativeness. The majority of the contributions are based on data produced by EFL learners from Germany, Italy, Japan, Spain, and the Netherlands, while others concern learners of Italian and Spanish as Foreign Languages.

Section 1 has a particular focus on the use of learner corpora for language teaching and assessment. The section opens with a chapter by MARCUS CALLIES, who discusses the current practices, benefits and challenges of using learner corpora for testing and assessing second/foreign language proficiency. He surveys practical applications of corpora in LTA which range from corpus-informed to corpus-based and corpus-driven approaches. Subsequently, he discusses some major methodological issues in LCR which pertain to learner corpus compilation and analysis, and their implications for LTA. He then exemplifies how learner corpora can be used to increase transparency, consistency and comparability in the assessment of second/foreign language writing proficiency in a data-driven approach that is partially independent of human rating.

In the second chapter, MARÍA BELÉN DÍEZ-BEDMAR examines the role played by errors in the analysis of learner language in LCR, focusing on article use by Spanish EFL learners. To do so, she first highlights ← 10 | 11 → the use of learners’ errors in some Complexity, Accuracy and Fluency (CAF) measures employed in SLA studies, and in LCR methodologies either in isolation, in Computer-aided Error Analysis (CEA), or complemented with the correct uses of the language, in an Interlanguage Analysis (IA), in Contrastive Interlanguage Analysis (CIA) and in the Integrated Contrastive Model (ICM). This chapter presents the phases involved in this challenging task and pays special attention to problematic aspects at each stage. It closes with an exemplification of the use of LCR methodologies to describe the use of the article system by Spanish learners of English at different institutional statuses and proficiency levels.

The remaining three chapters in Section 1 exemplify how learner-corpus data can be used to inform curriculum design and teaching practices. In his contribution, MICK O’DONNELL explores how a learner corpus can be used to distribute linguistic concepts over a language teaching curriculum, and thus identify where particular lexical and grammatical phenomena should be taught in rising proficiency levels. He puts forward an approach which does not attempt to place grammatical concepts at given proficiency levels, but rather works to order them according to acquisitional difficulty. Tense-Aspect usage in a corpus of essays written by Spanish EFL learners is taken as an example. He discusses how the information gleaned from this research can be used to re-plan teaching curricula.

MARTA GUARDA provides an overview of how a learner-corpus approach was integrated into an English as a Lingua Franca (ELF) Computer-Mediated Communication (CMC) exchange, involving English language students from Italy and Austria. By means of Web-based tools (e.g. Skype, wikis, Facebook) the students engaged in online collaborative work which aimed to enhance their language skills and foster intercultural awareness. A learner corpus was created from the written data derived from the students’ interactive tasks. The Italian students then explored this and were encouraged to reflect on linguistic and pragmatic aspects of their interlanguage (e.g. choice of lexis; agreement expressions). This chapter highlights the potential benefits of the use of learner data produced in CMC activities to raise pragmatic and linguistic awareness. ← 11 | 12 →

TIM MARCHAND and SUMIE AKUTSU describe a course for university students in Japan which uses CMC for the dual purpose of providing lesson materials online and collecting students’ written production to develop a learner corpus. The chapter outlines the course setup, explains how the corpus was constructed, and examines some of the preliminary results of their research into student output. The findings of the corpus analysis demonstrate where the learner corpus and a similar native-speaker corpus are convergent and divergent, especially with reference to the use of the personal pronouns I and you. The chapter concludes by highlighting some of the pedagogical and motivational benefits of using CMC and a CMC-derived corpus in language teaching.

Section 2 moves on to a series of longitudinal studies conducted within the Longitudinal Database of Learner English (LONGDALE) project, coordinated by Fanny Meunier of the Université catholique de Louvain. After a brief introduction to the project by FANNY MEUNIER, PIETER DE HAAN reports on a study of the development of syntactic control in EFL. He starts from the consideration that formal academic writing is characterised by a nominal rather than a verbal style, implying that more mature writers are likely to use more noun phrases (NPs). At the same time, formal writing tends to be more complex, suggesting that the NPs produced by more mature writers are also more complex. He analyses the writings of two Dutch first-year university students of English with comparable average scores on grammar placement tests and grammar exams. Data was collected from these two students on two occasions with a three-month interval. The analysis suggests that they are beginning to develop into more mature writers. However, they are also shown to behave differently from each other, suggesting that one of them developed at a greater speed than the other. The comparison of the writings of these students suggests that there can be a wide gap between (passive) EFL knowledge and (active) EFL control.

In the following chapter, CAROLINE GERCKENS and ANNE GANS investigate tense and aspect usage and the phraseology of high-frequency verbs in the written performance of intermediate/advanced German students of English. Their data comes from the German branch of the LONGDALE project, and their main focus is on student progress under different instruction conditions over a one-year period. They ← 12 | 13 → investigate whether the students’ output displays a significant difference in error frequency in these two areas, and what the pedagogical implications may be for language instructors. The preliminary findings seem to indicate that variation in the results stem from differences in text types.

PASCALE GOUTÉRAUX explores affect-based acquisition and the ability to express attitudinal stance and convey feelings in speech by using rich, accurate and fluent language. She analyses the results of an experiment carried out with second and third-year French university students reacting to aesthetic objects as part of the French component of the LONGDALE project, and assesses the evolution of their spoken skills over two years. The participants reacted spontaneously to four anonymised works of art, scaled them in terms of valence and discussed the aesthetic or personal experiences they elicited with an English assistant. The analysis highlights the role of task specificity as an independent variable and of personal relatedness and proficiency as dependent variables in modulating the richness of affect speech. While most productions display a wider range of linguistic markers than expected, the main differentiating features between more and less competent users are the mastery of affect-related complex syntactic structures and metaphorical imagery. These results support the increased use of native and non-native corpora to develop both the awareness and the production of authentic appraisal discourse.

ERIK CASTELLO explores the use of it-extraposition constructions in the Italian component of the LONGDALE project, and specifically in a sub-corpus made up of reports and argumentative essays written by Italian undergraduate university students in the second and third year respectively. The results show that the use of it-extraposition increases over the two years, and that the extraposed embedded clauses (e.g. for/to clauses) and the adjectives and past participles employed in them are more varied in the third year. Furthermore, the data seems to suggest that the mistakes made by the learners are mainly due to the unsuccessful combination of it-extraposition with other features of academic English (e.g. the passive voice, post-modification of noun phrases). Finally, the analysis of the rhetorical aims of the constructions reveals that the learners use them mainly to emphasise their claims, yet in the third year they also perform other functions through them, which to a ← 13 | 14 → large extent is likely to be due to the differences between the types of texts written in the two years.

In the final chapter of this section on longitudinal corpora, KATHERINE ACKERLEY investigates Italian students’ use of the noun phrase in online self-presentations both before and after their exploration of a native speaker corpus of the same text type. In line with other contributions to this volume, the two learner corpora of self-presentations explored here are formed from texts produced for pedagogical tasks. In this chapter the second learner corpus is examined for evidence of language development following data-driven learning tasks based on the native speaker corpus. Katherine Ackerley notes that students’ awareness of how to structure information in a noun phrase increased, though some lack an awareness of the order of information and the appropriateness of limiting the amount of information conveyed in one noun phrase. Although the text type focused on here is informal, she also discusses the impact these types of task can have on students’ academic writing.

While all the chapters in this volume analyse learner interlanguage, those in Section 3 focus explicitly on specific linguistic aspects of the interlanguage of Italian and German learners of English as well as on those of learners of Spanish and Italian as foreign languages. FRANCESCA COCCETTA and SILVIA SAMIOLO discuss the results of two strands of investigation of two corpora of elicited speech: a corpus of interactions between EFL Italian students and a corpus of interactions in English between native speakers. The two main areas of analysis are the speech function of command and the use of modal verbs. Both types of investigation adopt a Systemic-Functional-Linguistic approach to language, compare the linguistic repertoires of the learners and of the native speakers in the realization of interpersonal relationships, and attempt to reveal differences and/or similarities in the ways they interact. The analyses were prompted by the observation that differences between learners and native speakers in the realizations of the speech function of command and in the use of modal verbs are often not only connected to ‘correct’ or ‘incorrect’ grammar, but also to cultural differences which often bring about specific effects in the construction and maintenance of interpersonal relationships. A deeper awareness of these differences ← 14 | 15 → should facilitate and improve learners’ awareness of and sensitivity to the different ways face is negotiated in the target language.

CONOR GEISELBRECHTINGER explores how a large corpus of papers written by advanced students of English at a German university can be used to investigate the use of modal auxiliaries, and of the central modal must in particular, a notoriously difficult area for advanced learners of English. The teaching of modals in German universities has traditionally placed focus on grammatical errors, for example those concerning syntax or tense, or on German-English translation problems. The investigation presented in this chapter demonstrates that the problems students face go beyond this narrow framework into areas such as typological misuse, semantic ambivalence or inappropriate register. By applying a systematic characterisation of errors it shows that the majority of misuses do not lie in the area of grammar or German interference, but rather in making unsubstantiated inferential statements or claims based on an intrinsic or extrinsic authority, i.e. it is unclear whether the learners’ statements are meant subjectively or objectively. The discussion aims to lead to a refocusing on these areas of modality and their consequences on course design and language teaching.

In their chapter, VENTURA SALAZAR GARCÍA and ABBAS F. ELIWEY focus their attention on Iraqi learners’ use of the Spanish copular verbs ser and estar. It is well known that the existence of two copular verbs in Spanish grammar can cause errors in the production of students learning Spanish as a foreign language, even at a high proficiency level. While the existing literature on this topic is based on L1 English learners, this chapter aims to contribute to a better understanding of the role played by the opposition ser vs. estar in the interlanguage of Arabic-speaking learners of Spanish. In order to do so, the authors investigate a corpus of essays written by students learning Spanish at the University of Baghdad, and take into account not only wrong cases and omissions, but also correct occurrences. They find that their errors are mainly intralinguistic, i.e. Spanish-specific, that errors attributable to native language influence are few and of minor importance, and that the extension of estar causes more errors than the extension of ser. They thus conclude that estar is especially problematic in Arabic-speaking learner interlanguage. ← 15 | 16 →


ISBN (Softcover)
Publication date
2016 (January)
corpus linguistics cross-sectional learner corpora language teaching and assessment corpus compilation and representativeness EFL longitudinal learner corpora error analysis learner corpora
Bern, Berlin, Bruxelles, Frankfurt am Main, New York, Oxford, Wien, 2015. 358 pp.

Biographical notes

Erik Castello (Volume editor) Katherine Ackerley (Volume editor) Francesca Coccetta (Volume editor)

Erik Castello and Katherine Ackerley are tenured researchers and lecturers in English language and linguistics at the University of Padua, Italy, while Francesca Coccetta is a researcher and lecturer in English language and linguistics at Ca’ Foscari University of Venice, Italy. Their research interests include corpus linguistics and computer and Internet technology for language teaching and assessment. They currently use computer learner corpora to inform both their research and teaching.


Title: Studies in Learner Corpus Linguistics