Corpus-based studies on language varieties

by Francisco Alonso Almeida (Volume editor) Laura Cruz García (Volume editor) Víctor González-Ruiz (Volume editor)
This volume brings together a number of corpus-based studies dealing with language varieties. These contributions focus on contemporary lines of research interests, and include language teaching and learning, translation, domain-specific grammatical and textual phenomena, linguistic variation and gender, among others. Corpora used in these studies range from highly specialized texts, including earlier scientific texts, to regional varieties. Under the umbrella of corpus linguistics, scholars also apply other distinct methodological approaches to their data in order to offer new insights into old and new topics in linguistics and applied linguistics. Another important contribution of this book lies in the obvious didactic implications of the results obtained in the individual chapters for domain-based language teaching.

The use of corpus linguistics (CL) methodology is rapidly expanding as seen in the number of studies published fairly recently in collections, e.g. Gather (2014), Kerremans (2015), Leńko-Szymańska and Boulton (2015). This published material encompasses new ways of looking at data, old and present-day, and offers fresh insight to current linguistic issues. This is possible because research carried out on large amounts of data may lend more significant and accurate conclusions. In this context, CL stands as an ideal tool to detect and assess variation, and even change, in different varieties of language.

This volume is an example of the interaction between CL and language varieties. It contains ten papers, which focus on the analysis of language in different communicative and professional settings, but also embraces research on diaphasic and diatopic variation. These studies evince contemporary lines of research interests, and include language teaching and learning, translation, domain-specific grammatical and textual phenomena, linguistic variation and gender, among others, as we shall describe in the following paragraphs.

ASSUNTA CARUSO and ANTONIETTA FOLINO explore the use of corpus tools for terminology. Their paper describes the compilation of a specialized comparable corpus in the field of tourism. The corpus has enabled the authors to obtain semiautomatic terms from the bilingual corpus used in order to create a glossary and a thesaurus, which are useful for professional and general users with an interest in tourism. ← 7 | 8 →

PILAR LEÓN ARÁUZ and ARIANNE REIMERINK examine a corpus on the field of environment. They concentrate on the relevance of corpus analysis as a tool for the characterization of the multidimensional aspects of hyponymic structures in this compilation. They also describe how these structures can be represented in a terminological knowledge base (TKB). Their conclusion reveals that, while retrieving hyponymic structures automatically from a given corpus saves time and is certainly productive, manual revision is still necessary to validate the results obtained.

LEJLA ZEJNILOVIĆ looks into verbo-nominal constructions in legal texts. In particular, she studies these constructions, which she describes as periphrastic expressions, in a corpus consisting of both English-language summaries of decisions by the European Court of Human Rights and their Serbian-language translations. Though already marked as typical of legal texts, the author points out that these expressions have not been studied in detail in terms of structural and semantic specificities. With her chapter, she intends to contribute to a better characterization of these constructions, and to offer clues as to the translation of these periphrastic structures from English into Serbian.

MARÍA LUISA CARRIÓ PASTOR investigates the use of interactive writing in a corpus of research papers written by English speakers and Spanish speakers. The results of her enquiry make manifest that there certainly exist obvious differences in language use that are both domain-specific and cultural. The results of her study have also clear didactic implications for language teaching and learning.

KARIN AIJMER reports on the positions of actually in some national varieties of English, as represented in the ICE-corpora. The author demonstrates the importance of position, namely left-periphery, right-periphery and medial, for the correct interpretation of actually. She has also pinpointed differences concerning the position and function of this device, and these are clearly associated to any of the specific varieties considered.

Through the analysis of a corpus of art museum audio descriptive guides, SILVIA SOLER GALLEGO examines the prototypical move structure of this genre both from a qualitative and quantitative perspective. She seeks to determine the way this structure relates to its communicative ← 8 | 9 → dimension and its implications for visually impaired visitor’s access to museums.

Within the discipline of specialised translation, CAROLINE ROSSI, CÉCILE FRÉROT and ACHILLE FALAISE take up the gauntlet of providing French-English medical translation students with corpus-based tools to help them make decisions about how to translate noun phrases into English. Aware that this is a particularly challenging issue for French-speaking trainees, they rely on the pedagogical uses of corpora as an aid to the production of more idiomatic target language renditions of source texts. But they do so by highlighting the need to provide these data in a controlled environment, in this case in the form of a lightweight corpus query interface specifically designed to suit learners’ needs.

GEOFFREY S. KOBY describes a multidirectional parallel corpus of certification examination texts (from the American Translators Association certification examination), and it provides a one-to-many alignment between one source and multiple target texts (in various language pairs) spanning a wide range of quality. As a first step in this project, the author describes the processing of the already existing handwritten exams; as a second step, he intends to design a system to capture and evaluate current data coming directly from the ongoing examination program.

IRIA BELLO VIRUEGA analyzes the use of lexical deverbal nominalizations in the Corpus of English Texts on Astronomy (CETA), taken from the Coruña Corpus, designed to contribute to the diachronic study of English. The author focuses specifically on those nominalizations that are formed by suffixation and indicate a process.

Finally, ISABEL MOSKOWICH’S paper addresses the relationship between language use and gender in a collection of Modern English texts included in the Coruña Corpus of English Scientific Writing. Her research reveals that women’s writing exhibit a great deal of interactional and other rhetorical devices in order to convey scientific thought, somehow beyond traditional expectations concerning frequencies of occurrence of certain metadiscursive strategies. Moskowich highlights reasons accounting for this linguistic overreaction, one of which could lie in the fair female intent of occupying their place in the scientific sphere, obviously under the male dominance. ← 9 | 10 →

We hope that this volume will contribute to the study of language varieties from a corpus linguistic approach. The papers touch on specific aspects of different discourse units and structures in varied (specialized) language settings, both written discourse and verbal interaction, to add new insight to current linguistic research. Another important aspect of this book is its didactic overtones, and, for this reason, it may be useful for language teaching, especially domain-based teaching. Much research portrayed in the papers included in this collection represents work in progress. So, we look forward to see how this research progresses and, most importantly, to see how further work may follow from the contributions here presented.

1 Francisco Alonso, Laura Cruz and Víctor González are members of the Emerging Technology Applied to Language and Literature Research Group, a division of Instituto para el Desarrollo Tecnológico y la Innovación en las Comunicaciones (IDeTIC-ULPGC).

Corpus-based knowledge representation in specialized domains1

1.  Introduction

The advantages of using corpus tools in terminological work have by now become well founded (Bowker 1996; Bowker and Pearson 2002; Meyer and Mackintosh 1996; Pearson 1998). In particular, the advantages of creating thesauri through a corpus-based approach include the possibility of extracting terms which are actually used in current written language according to evidence-based linguistic criteria. Indeed, criteria such as representativeness and balance, established by corpus linguistics as indexes of a well-constructed corpus, should be considered in this particular use of corpora, i.e. defining terminological resources that describe a specific domain in a comprehensive manner. Accordingly, the quality of the corpus could be measured a posteriori, by evaluating the quantity and the representativeness of the extracted terms. To this end, this paper aims at presenting the compilation of a specialized comparable corpus in the domain of tourism followed by the construction of a controlled vocabulary, whose function will consist in domain knowledge representation, terminological control, indexing and information retrieval.

The work described in this paper has been conducted within the framework of the project DiCeT-INMOTO-OR.C.HE.S.T.R.A2, part of ← 11 | 12 → the “Programma Operativo Nazionale Ricerca e Competitività 2007–2013 – Smart Cities and Communities and Social Innovation”, funded by the Italian Ministero dell’Istruzione, dell’Università e della Ricerca (MIUR).

The following sections describe the stages of our approach in more detail, that is, corpus compilation, terminology extraction and glossary and thesaurus construction.

2.  Corpus-based terminology: a brief overview

Corpus-based terminology is defined by Gamper and Stock (1998: 149), as “a working method which explores a collection of domain-specific language material (i.e. corpus) to investigate terminological issues”. While the use of machine-readable corpora has been well-established in lexicography and language for general-purpose work for some time, it has taken longer for corpus-based terminology to become an established procedure. Castagnoli (2006) posits that this could be due to the different nature of the corpora involved, which are large, general and easily reusable in the former case, and domain-specific, smaller and difficult to re-use in the latter.

According to Sager (1990: 133), a corpus-based approach to terminology opens up the possibility to gather conceptual, linguistic and usage information about the terminological units. Corpus linguistics tools and techniques can assist terminologists throughout the various stages of a terminology project: at the beginning, when the main issues are to identify term candidates and to provide evidence for and about them, as well as in the later stages of compiling definitions or selecting contextual examples (Kast-Aigner 2010). Bowker (1996: 30–31) argues that there are several advantages to using corpora in terminology. Firstly, machine-readable corpora make it possible for terminologists to increase both the speed and the scope of their research. In addition to allowing larger quantities of data to be processed more rapidly, thereby exposing terminologists to a larger number of conceptual descriptions, corpora also allow them to skip over the parts of a text ← 12 | 13 → that are insignificant from a terminological point of view and to focus on those parts which are terminologically interesting (Bowker 1996: 31–32). These latter parts are referred to by Meyer (2001: 281) as “knowledge-rich contexts”, which contain “at least one item of domain knowledge that could be useful for conceptual analysis”.

Moreover, a machine-readable corpus facilitates the investigation of syntactic and semantic information along with linguistic patterns which are difficult to discover when manually scanning texts. This information can be retrieved by examining concordances, also referred to as ‘key words in context’ (KWIC), which present terms in a variety of different contexts and which reveal collocational information that may aid in understanding and using the terms more effectively (Bowker 1996: 32–33).


