Specialisation and Variation in Language Corpora

by Ana Diaz-Negrillo (Volume editor) Francisco Javier Diaz-Pérez (Volume editor)
©2015 Edited Collection VIII, 346 Pages
Series: Linguistic Insights, Volume 179


Corpus linguistics was initiated with the compilation and exploitation of native English reference corpora. Over the past years, corpus linguistics has experienced such a great expansion and specialisation that a variety of languages, registers, text types and speakers are now represented in language corpora. This volume intends to give evidence of the extraordinary expansion that corpus linguistics and language corpora have undergone. It focuses on emerging types of corpora and corpus techniques, and also presents corpus-based studies in areas which have benefited from the recent developments in corpus linguistics methods and techniques, including foreign language teaching, language acquisition, translation and terminology dialectology, lexicography and language variation. The volume comprises 11 papers on technical aspects of corpus data processing, on corpus-based linguistic research, and on emerging corpora. It is structured in three main sections, one for each of the three latter aspects.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the Editors
  • About the Book
  • This eBook can be cited
  • Contents
  • Acknowledgements
  • Trends in corpus specialisation: Ana Díaz-Negrillo, Francisco Javier Díaz-Pérez
  • 1. Introduction
  • 2. Corpus-data processing
  • 3. Corpus-based linguistic analysis
  • 4. Further emerging specialised corpora
  • 5. Conclusion
  • References
  • Section 1: Corpus data treatment
  • Stand-off annotation in learner corpora: compiling the Greek Learner Corpus (GLC): Alexandros Tantos, Despina Papadopoulou
  • Abstract
  • 1. Introduction
  • 2. Presentation of the project
  • 2.1. Error annotation and analysis cycle
  • 2.2. Error annotation scheme
  • 2.3. Error annotation challenges
  • 3. Stand-off annotation in GLC
  • 4. Conclusions
  • References
  • Appendix
  • AixOx, a multi-layered learners’ corpus: automatic annotation: Sophie Herment, Anne Tortel, Brigitte Bigi, Daniel Hirst, Anastassia Loukina
  • Abstract
  • 1. Introduction
  • 2. Literature review
  • 3. The AixOx corpus
  • 3.1. Compilation
  • 3.2. Corpus size
  • 4. Annotation
  • 4.1. SPPAS
  • 4.1.1. Inter-pausal unit segmentation
  • 4.1.2. Word and phoneme segmentations
  • 4.1.3. Syllabification
  • 4.1.4. Enriched orthographic transcription
  • 4.2. MOMEL and INTSINT
  • 5. A possible pedagogical application: the example of questions
  • 5.1. Yes-no questions
  • 5.2. WH-questions
  • 6. Conclusion and perspectives
  • Acknowledgements
  • References
  • Appendix 1
  • Fiche de renseignements / Information sheet
  • Appendix 2
  • Consentement éclairé
  • Consent form
  • Conjunctive relations across languages, registers and modes: semi-automatic extraction and annotation: Ekaterina Lapshinova-Koltunski, Kerstin Kunz
  • Abstract
  • 1. Introduction
  • 2. Theoretical background
  • 2.1. Cohesive conjunctions vs. other cohesive devices
  • 2.2. Conceptualisation and classes
  • 3. Resources and tools to analyse conjunctive relations
  • 3.1. Corpus resources
  • 3.2. Annotation and extraction tools
  • 3.3. Annotation procedures
  • 4. Querying and analysing conjunctive relations
  • 5. Conclusion
  • Acknowledgements
  • References
  • Corpus strategies for multimodal text analysis in knowledge-based terminological bases: Juan Antonio Prieto Velasco
  • Abstract
  • 1. Introduction
  • 2. Theoretical and methodological underpinnings
  • 3. From corpora to terminological knowledge bases
  • 4. Exploring specialized knowledge representations through corpus strategies in EcoLexicon
  • 4.1. Corpus compilation
  • 4.2. Corpus annotation
  • 4.2.1. Conceptual tags
  • 4.2.2. Tags for visual components
  • 5. Corpus processing: methodology
  • 4.3. Elaboration of frequency wordlists
  • 4.4. Generation of concordance lines
  • 4.5. Knowledge-rich visual contexts (KRVCs)
  • 4.6. Elaboration of a conceptual and visual template
  • 5. Conclusions and further research
  • Acknowledgements
  • References
  • Section 2: Corpus-based linguistic analysis
  • Prosodic variation in the Basque language: intonational areas: Gotzon Aurrekoetxea, Iñaki Gaminde, Leire Gandarias, Aitor Iglesias
  • Abstract
  • 1. Introduction
  • 2. The EDAK corpus
  • 3. Data analysis
  • 4. Sociolinguistic analysis of the data
  • 5. Geolinguistic analysis of the data
  • 5.1. Geolinguistic analysis of the data from the younger generation
  • 5.2. Geolinguistic analysis of the adult generation data
  • 5.3. Geolinguistic variation and classification of the varieties
  • 6. Conclusions
  • Acknowledgements
  • References
  • Indicators of lexical growth throughout age, genre and modality for a Catalan L1 corpus: Laia Cutillas Alberich, Liliana Tolchinsky, Elisa Rosado, Joan Perera
  • Abstract
  • 1. Introduction
  • 2. Corpus compilation
  • 2.1. Informants
  • 2.2. Tasks
  • 2.3. Procedure
  • 2.4. Corpus storage
  • 2.4.1. Clean version (net)
  • 2.4.2. Morphologically-tagged version (morfo)
  • 2.4.3. Other versions
  • 2.5. Corpus processing
  • 3. Dimensions of analysis
  • 3.1. Lexical diversity
  • 3.2. Lexical density
  • 3.3. Word length
  • 3.4. Productivity of verbs
  • 4. Results
  • 4.1. Quantitative description of the corpus
  • 4.2. Lexical diversity
  • 4.3. Lexical density
  • 4.4. Word length
  • 4.5. Productivity of verbs
  • 4.5.1. Diversity of verbal lemmas
  • 4.5.2. Productive use of inflectional morphology of verbs
  • 5. Conclusion
  • Acknowledgements
  • References
  • Strategies of persuasion in a 16th century Hungarian remedy book: Ágnes Kuna
  • Abstract
  • 1. Introduction
  • 2. The corpus
  • 2.1. Ars Medica in the medical literary tradition of the 16th century
  • 3. Persuasion in the medical recipes
  • 3.1. The schema of medical recipes
  • 4. Persuasion and positive attitude in Ars Medica
  • 4.1. General Positive Attitude (2080/894)
  • 4.2. Testedness (2080/133)
  • 4.3. Certainty (2080/69)
  • 4.4. The Time Factor (2080/179)
  • 4.5. Result/The Removal Of Illness (2080/649)
  • 4.6. Authenticity – The Source of Persuasion (2080/179)
  • 4.7. The frequency of categories elaborating persuasion in Ars Medica
  • 5. Summary and conclusion
  • Acknowledgments
  • References
  • Section 3: Further emerging specialised corpora
  • Corpus design and exploitation for translation purposes: ENEUPECOR a French-Spanish bilingual corpus on neuromuscular diseases in paediatrics: María Magdalena Vila Barbosa
  • Abstract
  • 1. Introduction
  • 2. Corpus building
  • 2.1. ENEUPECOR: a corpus on neuromuscular diseases on paediatrics
  • 2.2. The corpus of study: compilation criteria, composition, size and statistics
  • 2.2.1. Sub-corpus in Spanish
  • 2.2.2. Sub-corpus in French
  • 3. Corpus processing and analysis
  • 3.1. Analysis of the most frequently-occurring tokens
  • 3.2. Polylexical lists: clusters and collocates analysis
  • 4. ECODE (definitional contexts extractor)
  • 4.1. Evaluation of the results of ECODE analysis
  • 5. Development of the online glossary
  • 6. Analysis of thematic chains and their markers
  • 7. Final remarks
  • Acknowledgments
  • References
  • ‘Understanding Science’ – A German popular science corpus: Uli Held / Karin Maksymski
  • Abstract
  • 1. ‘Understanding Science’ – objectives and challenges of a corpus-based analysis
  • 2. Overall corpus design
  • 3. The main corpus
  • 3.1. General considerations as to the design
  • 3.2. Criteria for construction of the main corpus
  • 3.2.1. Register, text type and medium
  • 3.2.2. Topics and Domains
  • 3.2.3. Source media
  • 3.2.4. Time span and authors
  • 3.2.5. Length and format
  • 4. Annotation
  • 4.1. Processing and workflow
  • 4.2. Word formation and complexity
  • 4.3. Vocabulary
  • 4.4. Morpho-syntactic categories
  • 4.5. Syntactic Functions and Dependencies
  • 4.6. Speech
  • 4.7. Rhetorical structure
  • 4.8. Discourse strategies
  • 4.9. Macro structure
  • 5. Conclusion and outlook
  • Acknowledgements
  • References
  • Le Dauphin project: a micro-corpus of correspondence in Lapurdian Basque of 1757: Manuel Padilla-Moyano
  • Abstract
  • 1. Introduction
  • 2. Contextualizing Le Dauphin corpus
  • 2.1. The discovery of Le Dauphin
  • 2.2. Linguistic context
  • 3. Correspondents’ typology, epistolary uses and subject matters
  • 4. Linguistic interest
  • 5. Towards the design of a micro-corpus
  • 5.1. Le Dauphin among other corpora
  • 5.2. A micro-corpus susceptible of extension
  • 5.3. Digital edition of Le Dauphin corpus
  • 5.4. Tagging and annotation by means of the TEI system
  • 6. Conclusion
  • Acknowledgements
  • References
  • An author dictionary on Attila József’s oeuvre: Attila Mártonfi
  • Abstract
  • 1. Introduction
  • 1.1. The genre of author dictionaries
  • 1.2. Attila József
  • 1.3. Problems of Hungarian morphology
  • 2. Main characterization of the author dictionary on Attila József’s œuvre
  • 2.1. Problems with alternative versions
  • 2.2. The appendices
  • 3. The entries
  • 3.1. The structure of the entries
  • 3.2. Head of the entries
  • 3.3. End of the entries
  • 4. Conclusion
  • Acknowledgements
  • References
  • Notes on Contributors
  • Index


We gratefully acknowledge the financial support of the Andalusian Regional Government of Spain (Consejería de Economía, Innovación, Ciencia y Empleo, 2/2011), which helped cover the organizational costs of the 4th International Conference on Corpus Linguistics, (22nd-24th March 2011, University of Jaén, Spain), as well as the publication of this book, which contains selected papers from the above-mentioned conference.

We are also grateful to a number of colleagues who acted as peer-reviewers in the selection process.

The Editors
Jaén, June 2013

← vii | viii → ← viii | 1 →


Trends in corpus specialisation


Computerised corpus linguistics set off around the 1960s with the compilation and exploitation of the first reference corpus of the English language. Over 50 years later, reference corpora are probably the largest in size and most consolidated corpus types. They are also perhaps the corpus type that reaches the largest number of users, as they are used by both specialists and non-specialists in linguistics. This is so much so that they are increasingly regarded as another reference tool of language use.

While English reference corpora have become consolidated, corpus linguistics and language corpora have also acquired a remarkable degree of expansion and specialisation. The continuous growth of corpus linguistics has been fostered by the interest of users in language-related areas, who have realized the powerful tool corpora can be in their disciplines. Nowadays, a large number of language corpora of an extensive variety of languages exist. Indeed, national corpora of major languages are available, as well as corpora of languages spoken by smaller communities. Corpora currently also cover a range of registers, text types and subject fields. Actually, the increasing specialisation in corpus linguistics has made it possible to investigate a variety of linguistic aspects for a range of applications in a variety of linguistic areas, for example, language teaching, second language acquisition, translation, terminology, stylistics, discourse analysis, etc.

Progress in the design and implementation of data processing tools has also played a crucial role in the development of corpus linguistics. Originally, most of these tools were designed for written, ← 1 | 2 → native, non-specialised corpus data, some of which were later transferred to specialised corpora. However, in order to suit the particular features represented by the language or text type in question, specific tools were necessary in order to cater for specialised corpus data and the methodological approaches required in a variety of specific domains.

This volume is further evidence of the extraordinary expansion that corpus linguistics and language corpora have gone through over the past years. It focuses on emerging corpus types, corpus techniques and corpus-based linguistic studies in areas that can now be researched as a result of the recent development of corpus linguistics, and which were presented at the 4th International Conference on Corpus LinguisticsLanguage, Corpora and Applications: diversity and change’ (22–24 March 2012, University of Jaén, Spain). In so doing, this volume also intends to support work in corpus linguistics, which may lead to the initiation, development or consolidation of new approaches to the design, processing and analysis of language corpora.

In order to give evidence of the expansion of language corpora, the volume covers small and specific corpora, both as to the languages represented in the corpora (Basque, Greek, Catalan, Hungarian), and the type of corpora they represent (learner corpora, translation corpora, correspondence corpora and technical –medical– corpora). Specifically, it comprises papers on technical aspects of corpus data processing (section 1), on corpus-based linguistic research (section 2), and on emerging corpora (section 3), all of which will be discussed, in this order, in the rest of this chapter.

2.Corpus-data processing

The first section in the volume deals with procedural issues that are central in corpus linguistics: corpus annotation in written and oral corpora, automatic identification and extraction of linguistic items, ← 2 | 3 → and corpus multimodality. The section is mainly occupied with types of corpora associated with areas of applied linguistics (learner and translation corpora) and which, due to their nature, require special corpus analysis techniques.

TANTOS/PAPADOPOULOU and HERMENT ET AL. give evidence of the development of learner corpora in recent years. Learner corpora began to be compiled around the 90s as collections of written material of non-native language to be used for pedagogical and SLA research purposes (Granger 2002). In terms of corpus annotation, and due to their language-specific features, learner corpora have largely relied on manual annotation, specifically error and interlanguage annotation, and on tools which were designed for native corpus data, like POS tagging (Granger et al. 2009) (for an overview, see Díaz-Negrillo/Thompson 2013). In recent years, however, the identification of learner-specific features in corpus data has started to become at least partially automatized, and the annotation procedures have also become more sophisticated.

In this volume, TANTOS/PAPADOPOULUS look at error annotation of a corpus of learner Greek: the Greek Learner Corpus (GLC). Their work stands among the first initiatives to compile a corpus of Greek learner language (cf. also Tzimokas 2010) and, in particular, it stands out for the formalisation of its error annotation in a multi-layered fashion. The tagset has been designed following the hierarchical structure of error categories originally proposed by Dagneaux et al. (1996) and is implemented in the corpus using UAM corpus tool (O’Donnell 2008), a software which stores multi-layered corpus annotations.1 This freeware is used nowadays for a variety of manual annotation types. Some outstanding features are that it requires no expertise in programming on the part of the user and that it is rather versatile containing also a tool for statistic analysis. Finally, the paper explains the annotation standard used for the corpus. While the TEI2 has been widely used as a format standardisation purposes in language corpora (cf., however, also TUSNELDA3), the GLC uses the stand-off ← 3 | 4 → annotation strategies of LAF (Ide/Romary 2004). This, as the authors explain, is an improvement with respect to other formats that do not support an underlying annotation model.

While the pioneer learner corpora were mostly written, which is also the case of the GLC, oral learner corpora started to be collected once the standards of corpus design and development had reached a certain degree of standardisation and the technical means available allowed easier processing. A number of oral learner corpora are well-known today of various non-native languages like FLLOC (French), SPLLOC (Spanish) and CYLIL (various languages). Some of them also include written and spoken counterparts like ICLE (written) and LINDSEI (oral), MICASE (oral) and MICUPS (written).4 Some corpora have been specifically designed for the study of pronunciation, like LeaP (German), ANGLISH (English) or AixOx (English and French) described in this volume in HERMENT ET AL.’s paper. The authors of this paper are very actively embarked on the development of oral corpora and techniques for the phonetic analysis of learner corpus data. AixOx is described in this volume as a multilingual learner oral corpus containing both native and non-native texts in English and French. A multilingual design allows for multiple comparisons across the various components in the corpus and, as a result, has an evident potential for corpus-based SLA studies. The paper describes the tools used for automatic multi-layered annotation of the AixOx corpus, in particular, SPAAS, for automatic alignment of speech recording and phonetic transcription of speech, MOMEL, an algorithm for phonetic representation of intonation patterns, and INTSINT, another algorithm for surface phonological representation of intonation patterns. Finally, the paper illustrates the applicability of the corpus and tools to foreign language pedagogy, by presenting a pilot-study on the intonation of English yes/no and wh-questions.

LAPSHINOVA-KOLTUNSKI/KUNZ is the third paper in this first section of corpus data processing. It looks at semi-automatic detection of conjunctive devices in a multilingual corpus and multi-register corpus. The corpus in question is GECCo (Kunz/Lapshinova-Koltunski ← 4 | 5 → 2011). It is a multilingual corpus as it comprises subcorpora in English and German, and it is multi-register as it also contains written and oral data, original and translated, and academic and non-academic. A major strength of the corpus is that the types of texts it contains and its design opens up the research possibilities in the fields of comparative linguistic and translation. The purpose of the paper is, however, technically-related, as it aims to implement semi-automatic extraction of a variety of cohesive devices across the subcorpora. For such purposes, they annotate the corpus at various levels, including word and sentence levels. These annotations are later queried using the corpus query processor CQP (Evert 2005) so as to arrive at semi-automatic annotation and extraction of cohesive devices in the corpus. The annotations of the conjunctive devices are later checked and revised manually using the annotation tool MMAX (Müller/Strube 2006).

The last paper in this section deals with techniques in multimodal corpora. In so doing, the paper presents a further corpus type that is gaining ground in corpus linguistics, the multimodal corpus, and an applied area, language translation, where corpora are used more and more frequently as a basic research tool and as a tool for professional purposes. PRIETO VELASCO’s paper shows that multimodal corpora can stand as a powerful resource for terminological and terminographic purposes. Specifically, Prieto Velasco discusses the crucial role of multimodality within the framework of Frame-based Terminology theory (Faber et al. 2005), and describes the corpus-based methodological steps taken in the elaboration of an entry using as an example the database EcoLexicon. These methodological steps include the selection of terms from language corpora using frequency analysis of the multimodal corpora, the annotation of visual data in the corpus according to a number of cognitive and semiotic properties, the elaboration of the definitions from the information obtained from corpus concordances, the identification of images according to previously defined properties, and the selection of relevant linguistic contexts. ← 5 | 6 →

3.Corpus-based linguistic analysis

The second section of the volume illustrates corpus-based research into language. It comprises three papers each of them dealing respectively with prosodic dialectological issues, based on an oral corpus of Basque dialects (Aurrekoetxea et al.), lexical development, based on a developmental written corpus of Catalan (Cutillas et al.), and communicative strategies based on a 16th century medical corpus in Hungarian (Kuna).

AURREKOETXEA ET AL. study the intonation of various patterns of the different dialects of Basque. With focus on intonation, the paper is particularly relevant as it contributes to an area which has been only barely explored, even for the standard variety of the Basque language. In addition, it is the first approximation to the Basque dialects’ prosodic rules carried out so far. For the purposes of the paper, the authors use a preliminary version of an oral corpus of Basque dialects, the EDAK corpus (EUskal DIAlektologia), and explore the various intonation patterns in the production of yes/no questions, wh-questions and statements. Finally, and most interesting, with the aid of dialectometric techniques and synthetic cartography, the authors of the paper present a series of maps which show the characterisation of dialects arrived at in the study.

CUTILLAS ET AL. explore lexical development in a corpus of Catalan. The main objectives of the paper are to explore lexical development in speakers of Catalan from childhood to adulthood, and also to evaluate measures that are used to quantify such development and identify which of the measures evaluated may stand as valid tools for such purposes. The paper first describes the corpus GRERLI-CAT1 (Grup de Recerca per a l’Estudi del Repertori Lingüístic, Català L1), which comprises narrative and expository spoken and written texts in Catalan produced by 79 bilingual Catalan/Spanish speakers. Then, it goes on to apply and evaluate the following parameters for the evaluation of lexical development: lexical diversity, as an indicator vocabulary richness, lexical density, as an indicator of textual and information richness, word length, as an ← 6 | 7 → indicator of lexical complexity, and verb productivity, as a subcategory of lexical richness within the lexical category of verbs.

KUNA’s is the last paper in this section. It investigates the strategies of persuasion in the Ars Medica corpus. This corpus is the digital version of the earliest surviving medical book written in Hungarian, entitled with the same name of the corpus. From a functional-cognitive perspective, the author of this paper sets off to explore the linguistic representations that activate persuasion in this remedy book, and the conceptual categories that intervene in this communicative function. Corpus-based procedures enable the author to distinguish and characterise the various patterns in these acts of persuasion and to weigh up their relative frequency.

4.Further emerging specialised corpora

The third section of the book is devoted to the presentation of projects which aim at the development of specialised corpora. Vila Barbosa describes a parallel corpus compiled for terminological and translation purposes. HELD/MAKSYMSKI describes a corpus collected for investigation into popular science language features. Padilla-Moyano describes a correspondence corpus from the 18th century compiled to study language development in Lapurdian, a dialect of the Basque language. Finally, Mártonfi describes the corpus used for lexicographic purposes and the necessary steps for the compilation of an author dictionary.

VILA BARBOSA analyzes the possibilities offered by comparable corpora to terminology and translation practice. Her chapter describes the compilation of the ENEUPECOR corpus, a French-Spanish bilingual comparable corpus made up of scientific papers on neuromuscular diseases in paediatrics, as well as the results of a terminological analysis of that corpus carried out to produce a bilingual corpus-based glossary. The exploitation tasks carried out analysing the thematic chains in the texts included in the corpus are also explained by Vila ← 7 | 8 → Barbosa, whose initial hypothesis is that such an analysis would permit to identify the thematic areas which present a more significant terminological and referential development. The results of this study have a clear applicability to fields such as terminology or translation practice. As the author states, the immediate usefulness of the exploitation of the ENEUPECOR corpus is the elaboration of a French-Spanish online glossary. The corpus is part of a wider research project consisting of the creation of French-Spanish bilingual corpora which could be used by scientific and technical translators working in the area of neuromuscular diseases. According to Vila Barbosa, both the thematic sub-domain – the neuromuscular diseases in paediatrics – and the selection of languages – French and Spanish – respond to a real translation demand.


VIII, 346
ISBN (Softcover)
Publication date
2014 (October)
teaching acquisition translation dialectology corpus linguistics
Bern, Berlin, Bruxelles, Frankfurt am Main, New York, Oxford, Wien, 2014. 346 pp.

Biographical notes

Ana Diaz-Negrillo (Volume editor) Francisco Javier Diaz-Pérez (Volume editor)

Ana Díaz-Negrillo is a lecturer in English Linguistics at the English and German Department of the University of Granada, Spain. She specialises in learner corpus research and English morphology. Francisco Javier Díaz-Pérez Is a lecturer in Linguistics and Translation at the English Department of the University of Jaén, Spain. He specialises in Pragmatics and Translation Studies.


Title: Specialisation and Variation in Language Corpora
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
book preview page numper 27
book preview page numper 28
book preview page numper 29
book preview page numper 30
book preview page numper 31
book preview page numper 32
book preview page numper 33
book preview page numper 34
book preview page numper 35
book preview page numper 36
book preview page numper 37
book preview page numper 38
book preview page numper 39
book preview page numper 40
371 pages