Corpus Analysis for Descriptive and Pedagogical Purposes

ESP Perspectives

by Maurizio Gotti (Volume editor) Davide S. Giannoni (Volume editor)
©2014 Edited Collection 432 Pages
Series: Linguistic Insights, Volume 200


There is hardly any aspect of verbal communication that has not been investigated using the analytical tools developed by corpus linguists. This is especially true in the case of English, which commands a vast international research community, and corpora are becoming increasingly specialised, as they account for areas of language use shaped by specific sociolectal (register, genre, variety) and speaker (gender, profession, status) variables.
Corpus analysis is driven by a common interest in ‘linguistic evidence’, viewed as a source of insights into language phenomena or of lexical, semantic and contrastive data for subsequent applications. Among the latter, pedagogical settings are highly prominent, as corpora can be used to monitor classroom output, raise learner awareness and inform teaching materials.
The eighteen chapters in this volume focus on contexts where English is employed by specialists in the professions or academia and debate some of the challenges arising from the complex relationship between linguistic theory, data-mining tools and statistical methods.

Table Of Contents

  • Cover
  • Title
  • Copyright
  • About the Editors
  • About the Book
  • This eBook can be cited
  • Contents
  • Introduction: Maurizio Gotti / Davide S. Giannoni
  • 1. Corpus analysis and specialised discourse
  • 1.1. Language description
  • 1.2. Pedagogical applications
  • 2. Contents of the volume
  • 2.1. Methodological issues
  • 2.2. Corpus-based descriptions
  • 2.3. Pedagogical applications
  • 3. Closing remarks
  • Methodological Issues
  • Which Unit for Linguistic Analysis of ESP Corpora of Written Text?: Lynne Flowerdew
  • 1. Introduction
  • 2. Units for linguistic analysis
  • 2.1. Frequency lists
  • 2.2. Keyword lists
  • 2.3. Lexical bundles
  • 2.4. Phrase frames
  • 2.5. Collocational frameworks
  • 2.6. Concgrams
  • 2.7. ‘Small words’
  • 2.8. Semantic sequences
  • 3. Other starting points for linguistic analysis
  • 4. Concluding remarks
  • Integrating Corpus and Genre Approaches: Phraseology and Voice across EAP Genres: Marina Bondi
  • 1. Introduction
  • 1.1. Pedagogic applications
  • 1.2. Research vs. popular genres: authorial voice and reader orientation
  • 2. Materials and Methods
  • 2.1. Choice of corpora
  • 2.2. Methods
  • 3. Analysis
  • 3.1. Use of you in popular writing
  • 3.2. Use of we: semantic sequences and functional patterns
  • 3.3. Authorial voice and modality
  • 3.4. Writer/reader identities
  • 4. Concluding remarks
  • Using Concgrams to Investigate Research Article Sections: Winnie Cheng
  • 1. Research on vocabulary and phraseology
  • 2. Corpus analyses of empirical research articles
  • 3. The present study
  • 4. Material and methods
  • 4.1. Corpus data
  • 4.2. Procedure
  • 5. Results
  • 6. Conclusion
  • Corpus Query Techniques for Investigating Citation in Student Assignments: Hilary Nesi
  • 1. Introduction
  • 2. Method
  • 3. Results and discussion
  • 4. Conclusion and ideas for the classroom
  • Researching Genres with Multilingual Corpora: A Conceptual Enquiry: Carmen Pérez-Llantada
  • 1. Research trends for genre analysis
  • 1.1. Discoursal features
  • 1.2. Academic genre types
  • 2. The phraseological profile of specialized genres
  • 3. The challenge of academic Englishes
  • 4. A note on pedagogy
  • Corpus-Based Descriptions
  • The Expression of Stance in Nurse-Patient Interactions: An ESP Perspective: Shelley Staples / Douglas Biber
  • 1. Introduction
  • 2. Methods
  • 2.1. The corpora
  • 2.2. Stance features
  • 2.3. Data analysis
  • 3. Overall trends for stance features across conversation and medical discourse
  • 3.1. Modals
  • 3.2. Stance adverbs
  • 3.3. Stance complement clauses controlled by verbs
  • 3.3.1. Stance verb + that-clauses
  • 3.3.2. Stance verb + to-clauses
  • 3.3.3. Stance complement clauses controlled by adjectives and nouns
  • 4. Conclusion
  • The Marking of Importance in ‘Enlightentainment’ Talks: Alan Partington
  • 1. Corpus design, use and aims
  • 2. Marking importance
  • 3. Types of importance-markers
  • 3.1. Concordancing lexical items and simple word patterns
  • 3.2. Necessity
  • 3.3. Personal relevance
  • 3.4. Big numbers
  • 4. The role of importance-marking in discourse organisation
  • 4.1. Wh-clefts
  • 4.2. Scale and importance-marking as an argument-framing device
  • 5. Importance-marking and good-bad evaluation
  • 6. Conclusion
  • Appendix
  • Investigating Blawgs through Corpus Linguistics: Issues of Generic Integrity: Giuliana Garzone
  • 1. Introduction
  • 1.1. The weblog as a genre
  • 2. Method
  • 2.1. Compiling a web-derived corpus: the case of blogs
  • 2.2. Corpus description
  • 3. Analysis: individualistic/existential elements
  • 3.1. Pronominal reference
  • 3.2. Self-mention and theme
  • 3.3. Lexical verbs with first-person singular pronouns
  • 4. Conclusions
  • Women’s Authorial Voice: Discursive Practices in Scientific Prefaces: Begoña Crespo
  • 1. Personal involvement in scientific writing
  • 2. Prefaces as a genre
  • 3. Material and methodology
  • 4. Results
  • 4.1. Frequency of selected features
  • 4.2. Features by century
  • 4.3. Features by genre
  • 4.4. Features by discipline
  • 5. Concluding remarks
  • Appendix
  • Prefaces considered in this study
  • Abstraction as a Means of Expressing Reality: Women Writing Science in Late Modern English: Isabel Moskowich / Leida Maria Monaco
  • 1. Introduction
  • 1.1. Women scientists in the Late Modern Period
  • 1.2. Abstraction in scientific discourse
  • 2. Corpus material
  • 3. Analysis of data
  • 3.1. Variation across discipline
  • 3.2. Variation across genres
  • 3.3. Conjuncts
  • 3.4. Passive constructions
  • 3.4.1. Agentless passives
  • 3.4.2. By-passives
  • 3.5. Adverbial subordinators
  • 4. Final remarks
  • Appendix
  • List of texts in corpus
  • Newsroom Jargon at the Crossroads of Corpus Linguistics and Lexicography: Roberta Facchinetti
  • 1. Introduction
  • 2. Newsroom jargon and lexicography
  • 3. Corpus construction
  • 4. Data analysis
  • 5. The glossary
  • 6. Conclusions
  • Appendix: Specimen of a glossary entry
  • Exploring Political and Banking Language for Institutional Purposes: Rita Salvi
  • 1. Introduction
  • 2. Corpus analysis and ESP
  • 3. Institutional language in the field of economics
  • 4. Methodology
  • 5. What the data says
  • 5.1. Keywords
  • 5.2. From keywords to aboutness
  • 5.3. Domains
  • 6. A corpus-based discourse analysis
  • 6.1. Cultural keywords
  • 7. Discourse organization
  • 8. Evaluation and rhetoric
  • 9. Final remarks
  • Appendix
  • Source texts
  • Family in the UK – Risks, Threats and Dangers: A Modern Diachronic Corpus-assisted Study across Two Genres: Jane H. Johnson
  • 1. Introduction
  • 2. Methodology
  • 3. Results
  • 3.1. Keywords with RISK*
  • 3.1.1. In NEWS 1993
  • 3.1.2. In NEWS 2005+
  • 3.2. Keywords with *DANGER*
  • 3.2.1. In NEWS 1993
  • 3.2.2. In NEWS 2005+
  • 3.3. Keywords with THREAT*
  • 3.3.1. In NEWS 1993
  • 3.3.2. In NEWS 2005+
  • 3.4. A synchronic comparison (risk, danger, threat)
  • 3.4.1. In NEWS 1993
  • 3.4.2. In NEWS 2005+
  • 4. Comparisons with the Sociology corpus
  • 5. Concluding remarks
  • Pedagogical Applications
  • Corpus Linguistics and Vocabulary Teaching: Perspectives from English for Specific Purposes: Averil Coxhead
  • 1. Introduction
  • 2. Specialised vocabulary and why it is important
  • 3. Using corpus linguistics to identify specialised vocabulary
  • 3.1. Word lists
  • 3.2. Multi-word units
  • 3.3. Vocabulary load and size
  • 4. How ESP practitioners can use corpora
  • 5. Challenges of corpus-based resources in ESP classrooms
  • 6. Some future directions for research and teaching
  • A ‘Speedful Development’: Academic Literacy in Chinese Learners of English as a Foreign Language: Cassi L. Liardét
  • 1. Introduction
  • 2. SFL and grammatical metaphor
  • 3. An integrated methodology
  • 4. Experiential GM: Framework of analysis
  • 5. Findings
  • 5.1. Patterns of reliance
  • 5.2. Incomplete and non-word reconstruals
  • 5.3. Infelicitous pluralisation
  • 5.4. Anaphoric reconstrual
  • 6. Conclusion
  • Variation in Academic Writing: Complexity, Pronouns, Modals and Linking in South African MA Theses: Josef Schmied
  • 1. Introduction
  • 2. The ZAMA Corpus
  • 3. Results
  • 3.1. Variation in complexity
  • 3.2. Personal pronouns
  • 3.3. Modal auxiliaries
  • 3.4. Linking texts explicitly
  • 4. Pedagogical applications
  • Formulaic Language in Economics Papers: Comparing Novice and Published Writing: Turo Hiltunen / Martti Mäkinen
  • 1. Introduction
  • 2. Formulaic language and the AFL
  • 3. Material and methods
  • 3.1. Data
  • 3.2. Method of analysis
  • 4. Results
  • 4.1. Frequencies of formulas across corpora
  • 4.2. Extent of variation
  • 5. Discussion and conclusions
  • Acknowledgements
  • Hands On: Developing Language Awareness through Corpus Investigation: Gillian Mansfield
  • 1. Introduction
  • 2. Corpora for language teachers and learners
  • 3. Working with corpora – some sample activities
  • 3.1. China Daily corpus
  • 3.2. Boat corpus
  • 3.3. Lipstick Names corpus
  • 3.4. Detective Novel Titles corpus
  • 3.5. Medical Translation corpus
  • 4. Conclusions
  • References
  • Notes on Contributors



1.Corpus analysis and specialised discourse

The study of language use through documentary evidence gleaned from variously large collections of authentic texts pre-dates by centuries the modern science of corpus linguistics. Since Samuel Johnson’s landmark Dictionary of the English Language (1755), lexicographers and the reading public have become aware that in language matters intuition is not enough, for the actual meaning/usage of words varies over time, from place to place and contextually. Driven by a similar interest, medieval scholars pioneered the first Bible Concordances (Schenker 2003), documenting the frequency and semantic range of root words in Scripture. Similar concordances were compiled after the advent of print from the works of literary classics such as Chaucer, Shakespeare and Milton, to name but a few.

Despite these early examples, the realisation that language description should always be corroborated by textual evidence is a relatively new development in linguistic research, with the main thrust coming from computational linguistic techniques in the 1950s, and the subsequent appearance of electronic computing machines (cf. Sinclair et al. 1970). The spread of personal computers in the late 1980s, combined with the inception of online media in the 1990s, has revolutionised the field in two major directions:

widespread accessibility of huge amounts of data with no agreed guidelines for its collection, storage or processing;

a dramatic shift from manual analysis to automatic data mining, based on dedicated software applications and increasingly complex statistical tools. ← 9 | 10 →

Such considerations explain many of the challenges faced to this day by corpus linguistics and corpus-assisted research. All the technology in the world cannot conceal the fact that language output is a human construct and its interpretation involves a degree of subjectivity, whatever the methodology employed. As Sinclair aptly admits, “we both trust our intuitions and keep a wary eye on the strong possibility of misunderstanding what we are observing” (2004: 44).

1.1. Language description

One of the common challenges analysts experience is how to relate theory and description. Should the corpus be approached as a source of ‘evidence’, capable of proving or disproving the researcher’s assumptions, or as an authority in its own right, whose ‘observation’ is a source of meaningful insights? This dichotomy is traditionally reflected in the terms corpus-based and corpus-driven (Tognini-Bonelli 2001), whose distinction however has become fuzzy in practice. Sinclair himself advocates a cross-fertilisation between linguistic description and theory (cf. Herbst et al. 2011), capable of mapping out not only how a language is used but also how it can be used in a given context. Significantly, the ponderous Routledge Handbook of Corpus Linguistics (O’Keeffe/ McCarthy 2010) has a chapter on the investigation of ‘creativity’, i.e. those idiosyncratic, unpredictable elements of communication whose range and occurrence is also constrained.

Even the purest quantitative analysis implies a purpose. Corpora are living creatures that come into being and are queried in view of some kind of benefit, which may either be direct (insights into phenomena) or indirect (gathering lexical, semantic and contrastive data for various applications). It is customary and useful, however, to distinguish between research occasioned mainly by an interest in description, and the latter type, more concerned with application. This distinction is reflected in the title given to the present volume.

The easy availability of digital texts, both online and offline, has encouraged researchers to branch out, specialising in areas of language use shaped by an array of sociolectal variables (register, genre, variety) and speaker variables (gender, profession, status). Each ← 10 | 11 → of these contexts can in turn be investigated synchronically or diachronically, in spoken or written discourse, and across media. The contributions presented below focus on ESP settings, that is on contexts where English is used by specialists (in the professions or academia) to communicate with their peers or in asymmetrical encounters. Because of its role as an international working language in so many fields, English is viewed here beyond the narrow confines of L1 speaker normativity.

1.2. Pedagogical applications

It is difficult to overstate the relevance of corpus analysis for language teaching. In the case of English, the focus has inevitably been on its largest constituency, i.e. EFL teachers/learners, but L1 classrooms are equally set to gain from such insights. This does not mean, of course, that data-driven teaching/learning is always adequate or effective (for a recent critical assessment, see Boulton/Tyne 2013). Possible pedagogic applications (Aijmer 2009) include: creating/using learner corpora to identify and monitor output; accessing online corpora in the classroom to raise awareness of language issues; and informing the production of teaching materials. Although the number of microlinguistic aspects amenable to investigation is endless, corpora are particularly useful for teaching the phraseology of English and its translation from/into other languages.

Corpus-based approaches have been criticised for not taking sufficient account of contextual features and pragmatic considerations, focusing instead on discrete ‘atomised’ textual units. While much of the attention in the literature has concentrated on patterns of lexical co-occurrence, it is often advisable to combine these insights with other variables, both within and outside the text, also through tagging and annotation (cf. Herbst et al. 2011). As Flowerdew (2005) points out, however, the analyst is often also a specialist informant in the case of ESP research, and as such can contribute insights drawn from his/her direct knowledge of the target discourse and its community of practice. ← 11 | 12 →

Striking the right balance between pedagogic ingenuity and methodological rigour is an open challenge for analysts and teachers alike. The technicalities of corpus linguistics (for example, Gries 2013 on measuring ‘collocation strength’) can admittedly discourage practitioners from venturing into uncharted territory. Linguists themselves are warned against getting carried away by technology rather than engaging critically with texts. A decade ago, Sinclair (2004: 54) pointed out that “most of the research projects in corpus linguistics that are in progress at the present time are not examining their languages at all, but are examining the tags”.

It is useful to bear in mind the distinction between reference corpora built for dissemination and public use, which need to be accessible over time across a range of platforms1 for different types of analysis, and small disposable corpora, assembled by individual researchers/practitioners for one-off investigations of specific features. A large proportion of pedagogic approaches rely on purpose-built material belonging to the latter category (cf. Ghadessy et al. 2001; Braun 2007). It is also true, however, that large general corpora may turn out to be relatively small, if we single out specific sub-sections or speaker variables.

2.Contents of the volume

Many of the points raised above are discussed in the chapters of this volume. Its contributions have been loosely grouped into three sections, according to their topic and analytical focus. However, some of the chapters will be found to straddle these neat labels, producing a certain degree of overlap between sections. ← 12 | 13 →

2.1. Methodological issues

The first section examines some general questions concerning the relationship between the field of corpus linguistics and that of specialised discourse. One of these questions is the choice of unit to investigate in a specialised corpus. This is the topic examined by LYNNE FLOWERDEW. Her discussion of several different corpus-based research studies of specialised text shows that they have different starting points, relying on different units of analysis. She notes that the vast majority of research commences from a bottom-up perspective in which lexis or some kind of lexico-grammatical unit is taken as the starting point for analysis, moving towards a more top-down discourse-oriented approach. Others, instead, adopt a top-down approach, by means of which the functional components of a genre are determined first, and then all the texts in a corpus are analysed in terms of these components. Many of the studies investigated, however, mediate between bottom-up and top-down approaches. Another finding of Flowerdew’s investigation is that the type of unit taken for linguistic analysis is often determined by the software used, which highlights the great complexity of the relationship between linguistics, software tools and statistical methods. The consideration that the majority of the corpus-based studies conducted in ESP are phraseological – and therefore primarily syntagmatic in nature – leads the author to point out that innovative developments in this field could profit from software programs which can capture also the paradigmatic relations of text.

MARINA BONDI investigates the role played by corpus and genre in approaches to English for Academic Purposes over the past 20 years and underlines how the interplay between the two notions, far from leading to contradictory methods, has proved extremely fruitful both from a descriptive and from a pedagogic point of view. Indeed, the integration of tools that can be related to the two notions has provided excellent means for the analysis of language variation across genres, cultures and disciplines. She then discusses the challenges and opportunities of combining the two approaches, as well as ways of integrating them through a study of language variation across research genres and popular genres in the discipline of history. The results of ← 13 | 14 → her analysis confirm that an exploration of phraseological patterns may be useful for identifying both discourse-specific and genre-specific elements, in particular different forms of self-mention and reader engagement.

WINNIE CHENG investigates the use of corpora to facilitate the analysis of specialised phraseology in English, in order to identify the phraseological tendency whereby particular words are co-selected by language users when they speak and write. To address this issue, she adopts a specially-designed computer-based methodology known as ‘concgramming’, based on the phraseological search engine Conc-Gram 1.0 (Greaves 2009). As a specific field of study she chose research articles (RAs), and the material for her analysis is drawn from the Hong Kong Corpus of Research Articles. By discussing similarities and divergences in the most frequent two-word concgrams across different RA sections, this study identifies RA section-specific word co-occurrences versus section-generic ones. Cheng’s analysis shows that in the Discussion, Literature Review and Abstract sections of RAs, words co-selected by authors tend to be generic, regardless of the disciplines to which the authors belong. Instead, authors tend to co-select discipline-specific words more frequently than generic ones when writing all the other sections of RAs.

Using texts drawn from the British Academic Written English (BAWE) corpus, HILARY NESI examines the forms of in-text citation employed by students, with comparisons across various disciplines. The corpus query techniques used in this study are explained in full so that future researchers can apply the same queries for themselves, perhaps to investigate specific disciplines or genres in the BAWE corpus, or to investigate citation forms in other similarly annotated corpora. The study also examines the reporting verbs occurring in integral citations identified through corpus queries. The findings indicate that, despite many similarities in citation practices, university students producing assessed coursework do not refer to sources in quite the same way as research students and writers of research articles. There also seem to be very big differences in the citation practices of different disciplines, probably reflecting differences in the genres students are required to produce. ← 14 | 15 →

CARMEN PÉREZ-LLANTADA critically reviews the main research trends used to analyse genres by means of multilingual corpora to determine how the latter can best help EAP researchers identify genre features across cultures and languages. Over the past decade, some academic corpora (such a those compiled within the KIAP, CADIS and SERAC projects) have provided rich insights into cross-cultural differences in the use of academic English discourse; these well integrate the results of analyses based on monolingual corpora representative of academic genres such as MICASE, BASE, MICUSP and BAWE. The main findings of these projects demonstrate that it is no longer appropriate to maintain a monolingual/monocultural perspective when examining this functional variety of English, because cross-cultural academic communication often involves substantial formal and functional variation across non-native English users.

2.2. Corpus-based descriptions

The second section of the volume consists of chapters presenting analyses of selected linguistic features, carried out with the use of relevant corpora of specialised discourse. SHELLEY STAPLES and DOUGLAS BIBER’s study targets the use of grammatical stance devices in medical settings, focusing in particular on their presence in a corpus of nurse-patient interactions. The findings highlight important differences in the use of stance by nurses and patients in comparison with speakers in general conversation. Indeed, nurses use more stance features, which allow them to communicate functions unique to the medical encounter. Moreover, these stance features are importantly influenced by the asymmetric nature of nurse-patient interactions, as nurses use them consistently to manage the encounter. On the other hand, patients generally employ fewer stance devices than nurses or speakers in general conversation. This is due to the fact that patients have fewer opportunities to express their personal feelings, attitudes and value judgments, as their role is to provide information to the nurse but not to comment on this information or to discuss possibilities or predictions. ← 15 | 16 →


ISBN (Softcover)
Publication date
2014 (March)
verbal communication linguistic evidence language phenomena
Bern, Berlin, Bruxelles, Frankfurt am Main, New York, Oxford, Wien, 2014. 432 pp., 50 ill.

Biographical notes

Maurizio Gotti (Volume editor) Davide S. Giannoni (Volume editor)

Maurizio Gotti is Professor of English Language and Translation, Head of the Department of Foreign Languages, Literatures and Communication, and Director of the Research Centre for LSP Research (CERLIS) at the University of Bergamo. His main research areas are the features and origins of specialized discourse. Davide S. Giannoni, PhD, is Associate Professor of English Language and Linguistics at the University of Bergamo. His research on academic and professional genres has appeared in several international journals. With Peter Lang he has published Mapping Academic Values in the Disciplines: A Corpus-Based Approach (2010).


Title: Corpus Analysis for Descriptive and Pedagogical Purposes
book preview page numper 1
book preview page numper 2
book preview page numper 3
book preview page numper 4
book preview page numper 5
book preview page numper 6
book preview page numper 7
book preview page numper 8
book preview page numper 9
book preview page numper 10
book preview page numper 11
book preview page numper 12
book preview page numper 13
book preview page numper 14
book preview page numper 15
book preview page numper 16
book preview page numper 17
book preview page numper 18
book preview page numper 19
book preview page numper 20
book preview page numper 21
book preview page numper 22
book preview page numper 23
book preview page numper 24
book preview page numper 25
book preview page numper 26
book preview page numper 27
book preview page numper 28
book preview page numper 29
book preview page numper 30
book preview page numper 31
book preview page numper 32
book preview page numper 33
book preview page numper 34
book preview page numper 35
book preview page numper 36
book preview page numper 37
book preview page numper 38
book preview page numper 39
book preview page numper 40
448 pages