Corpora and Language Change in Late Modern English
Summary
In the first part, the book provides an account of some available corpora for the study of Late Modern English, representing different text types such as medical English or private correspondence, among others. Additionally, these corpora cover various dialects and early new varieties of English.
In the second part, several corpus-based studies assess Late Modern English at different levels shedding light on the language of the period.
Excerpt
Table Of Contents
- Cover
- Title
- Copyright
- About the author
- About the book
- This eBook can be cited
- Table of Contents
- From Corpora to Data: Sources for the Study of Late Modern English
- Language Change in Ireland: Compiling and Using a Diachronic Corpus to Study the Evolution of an Early New English
- The Coruña Corpus of English Scientific Writing: The Gift that Keeps on Giving
- Medical English Writing in the Period 1700–1900: The Málaga Corpus of Late Modern English Scientific Prose
- The Corpus of Late Modern English Medical Writing: Scientific and Social Change in the Eighteenth Century
- Investigating Variation and Change in Late Modern English Dialects: The Salamanca Corpus
- Editing The Mary Hamilton Papers (c.1740–c.1850)
- Ridiculously Well or Madly Ambitious: Some Diachronic Notes on the Intensifying Adverbs Ridiculously and Madly
- Ephemeral Causal Adverbial Subordinators: Their Emergence and Decline in Modern English
- Demonstrative them in American English over Two Centuries (1820–2020)
- Tracking Down Marginal Productivity: The Suffix -ment between 1820 and 2019
- Past Participle Forms in Competition: -ed vs -(e)n in Historical British and American English
- Webster’s Spelling Reform: From -our to -or in Colour-Type Words
- Verbal Contractions in Late Modern English
- Amerindian Loanwords in Richard Hakluyt’s The Principall Navigations (1589) and Their Inclusion in Early and Late Modern English Dictionaries: Applications and Limitations of Digital Corpora, Databases and Tools in Lexicographical Research
- “My dearest friend … Ever Yours, Mary Hamilton”: Exploring Forms of Address in the Late Georgian Period
- Notes on Contributors
Javier Calle-Martín
From Corpora to Data: Sources for the Study of Late Modern English1
1. Introduction
Scholars traditionally refer to the language used between 1500 and 1900 as Modern English and draw a fine distinction between Early Modern English (1500–1700) (henceforth EModE) and Late Modern English (1700–1900) (henceforth LModE). It is widely known that the most radical changes in English grammar had occurred in the Early Modern period, such as the incipient standardization of spelling, the Great Vowel Shift, the collapse of the inflectional system or the fixing of SVO order, among others, along with the huge amount of foreign words incorporated into the lexicon. Even though LModE has been deemed as a period of stability in the system (Romaine 1998: 7), a cursory look at some linguistic factors reveals that much was going on then. The rise of technological innovations and the expansion of the British empire all over the world, both under the shelter of the Industrial Revolution, led to a process of geographical and social mobility with tangible implications in the development of the language, inasmuch as the new social networks acted, in Milroy’s terms, as a “norm enforcement mechanism” (1987) subsequently hoping from one centre of population to the other (Milroy and Milroy 1985: 378; see also Tieken-Boon van Ostade 2009: 10). Letter-writing, as a typical stance of ego-writing, lent itself well for the endorsement of personal variation and the ideal mode of transmission of some of these social innovations (Tieken-Boon van Ostade 2009: 39–41; Mugglestone 2006: 279–81).
The eighteenth and the nineteenth centuries thus witnessed the birth of modern sociolinguistics, which brought into play the social concept of language itself and, more importantly, the need of standardization to comply with the social requirements of the time, which primarily focused on differentiating new money from the old (Beal 2004: 94). These standards of correctness were in need of codification and it was in the LModE period when the precepts of the spelling reformers, grammarians and lexicographers became of paramount importance (Tieken-Boon van Ostade 2019: 8). These circumstances did not only favour the codification of British English. Within this trend of prescriptivism, the LModE period witnessed the creation of a distinctly American variety, pioneered by Noah Webster as the central actor in the creation of the American orthography (Upward and Davidson 2011: 302), whose norms soon permeated among the early immigrants in search for linguistic correctness (Tottie 2002: 9).
A linguistic period such as LModE, characterized by the search for standardization and codification, while still giving room to personal and social variation, stands out as a challenge for linguists, who ought to put some order out of the apparent chaos pertaining to the origin and driving force of particular linguistic features. It may become really hazardous to determine whether a given feature is just the result of the intrinsic development of the language, whether it responds to external motivations like the prescriptive bias of grammarians and lexicographers or whether it answers to any social, textual or idiosyncratic variation. It is at this point where corpora have come to play an outstanding role in language analysis and today we may congratulate on the release of medium-size and large corpora which, to a certain extent, come to satisfy many of the linguists’ needs.
The last decades have witnessed the proliferation of many diachronic corpora of LModE. Rather than aspiring to exhaustiveness, the following list is just limited to mentioning the corpora deemed as remarkable for their dimension, tagging and reputation in the discipline. Widely used are the Hansard Corpus (1803–2005) or the Eighteenth-Century Collections Online (ECCO) (1700–1799) for the analysis of British English; and the Corpus of Historical American English (COHA) (1820–2019) or the Evans Early American Imprints corpus (Evans) (1639–1800) for the newly-formed American variety in the period. The release of the Old Bailey Corpus also stands out as an added asset, housing c.134 million words from speech-related texts used in London’s Central Criminal Court between 1674 and 1913. In addition to these, other specialized corpora have proliferated which do not only illustrate usage over time, but also allow for the study of variation across particular genres, registers or text types, such as the latest version of A Representative Corpus of Historical English Registers (ARCHER) for the period 1600–1999; the Corpus of Historical English Law Reports covering 1535–1999; and Hendrik de Smet’s Corpus of Late Modern English Texts (CLMET) (1710–1920) with novels written by English and American authors. Among all these, Terttu Nevalainen’s The Corpus of Early English Correspondence Extension (CEECE) deserves special attention, containing more than two million words of letter-writing for the period 1680–1800, with a long-standing tradition and reputation for sociolinguistic research in the LModE period.
The present volume focuses on the leading role of corpora for the study of LModE, be it from a qualitative or a quantitative stance. For the purpose, the book has been divided into two parts. The first presents some of the currently available corpora for the study of the English language in the LModE period, where six different proposals are hereby presented: The Corpus of Irish English Correspondence; The Coruña Corpus of English Scientific Writing; The Málaga Corpus of Late Modern English Scientific Prose; The Corpus of Late Modern English Medical Writing; The Salamanca Corpus: Digital Archive of English Dialect Texts; and The Mary Hamilton Papers. The second section, in turn, houses a selection of corpus-based empirical studies combining the use of current corpora with up-to-date methods of linguistic analysis at different linguistic levels, from spelling to morpho-syntactic and lexical issues of eighteenth and nineteenth-century English.
2. The Primary Sources: Corpora
This first part begins with Amador-Moreno’s The Corpus of Irish English Correspondence (CORIECOR), a corpus especially conceived to provide insight into Irish English from a diachronic perspective. The corpus contains personal letters written by Irish emigrants and their recipients between the late seventeenth (1731) and the early twentieth centuries (1940), the period when Ireland became an English-speaking country (Amador-Moreno 2022: 49–56). In addition to the diachronic scope of the corpus, the compilators have been careful to select material from informants all over Ireland, a valuable asset to investigate the origin and development of linguistic features from the perspective of stylistic, regional and social variation, gender studies also included.2
The corpus is freely available at the project’s webpage (https://corviz.h.uib.no/index.php), where users may find a general description of the project and access the letters themselves along with their bibliographic details, such as the year, the sender and recipient, their relationship, origin, occupation, gender and religion. More important, in our view, is the advanced search tool, which allows to search for a particular word or string of words in the whole corpus – or a part of it – specifying, if necessary, a restricted set of conditions. CORIECOR has now become a primary source for linguistic research into Irish English not only synchronically and diachronically, but also from the perspective of variationist sociolinguistics. Despite being originally conceived in the year 2008 and materialized throughout the 2010s, the number of publications stemming from the corpus stands as a solid indication of the maturity and suitability of the product for research.3
CORIECOR is followed by Crespo and Moskowich’s The Coruña Corpus of English Scientific Writing (CC), an electronic corpus especially compiled for the study of eighteenth and nineteenth century scientific writing in English. In its current form, the corpus displays more than two million words, with approximately one million words for the eighteenth and the nineteenth centuries, respectively. The great advantage of the corpus is the inclusion of sample material from different disciplines in the attempt to account for the particular writing traditions and restrictions of each discipline. Accordingly, the corpus is divided into different subcorpora based on the UNESCO’s classification (1988), the list including CETA, Corpus of English Texts on Astronomy (2012); CEPhiT, Corpus of English Philosophy Texts (2016); CHET, Corpus of History English Texts (2019); CELiST, the Corpus of English Life Sciences Texts (2020); and CECheT, the Corpus of English Chemistry Text (2022).4
The compilers of the corpus have been careful in the selection of the material. When it comes to authors, both well-known and less famous figures have been included across the disciplines. As far as texts are concerned, edited and printed scientific prose is the prerogative, favouring first editions whenever possible. The corpus samples have also been selected in accordance with other extralinguistic factors, such as (i) genre, relying on the text categories that existed in eighteenth- and ninenteenth-century scientific texts (Görlach 2004); (ii) gender, giving room to women writing in spite of the evident minority of female authors at the time; (iii) geography, with details about the writers’ place of education for the analysis of variation within scientific discourse; and (iv) the age of authors in the attempt to investigate features from the perspective of the writers’ age groups. The project’s webpage hosts all the information about the corpus, the different subcorpora and the particularities of the tagged version of the product (https://www.udc.es/grupos/muste/corunacorpus/index.html). The corpus is also available in open access in the project’s webpage and on CQPweb at Lancaster University. The CC has had a long-standing trajectory since its inception in 2004 and has now become an essential source for the investigation of any linguistic feature in the period.
The volume follows with Calle-Martín’s description of The Málaga Corpus of Late Modern English Scientic Prose (CoLaMESP), the third component of The Málaga Corpus of Early English Scientific Prose. This corpus houses a collection of medical writing in the vernacular from the eighteenth and nineteenth centuries, the current version with 1.1 million words for the eighteenth century and another 1.2 million for the nineteenth. The corpus displays representative material from the three branches of medical writing in English, that is, theoretical treatises, surgical treatises and recipe collections. The product has been conceived as a primary source for linguistic research both over time and across text types. On the one hand, the corpus is organized into eight sub-periods of 25 years so that any linguistic feature may be surveyed from a diachronic standpoint. Textual variation, on the other hand, may be surveyed in light of the different rationale of medical writing, especially when it comes to theoretical/surgical treatises and recipe colections, the former as the most academic register written by surgeons and practitioners of the highest class and the latter as the language used by laypeople written by non-practitioners and barber surgeons.
The corpus is published and made available in the project’s webpage (https://latemodernmss.uma.es), offered in three different formats: the plain text version, the modernized version and the tagged version. The normalized version of the corpus is taken as the source data for the tagging of the corpus through an automatic process performed by CLAWS, a rule-based tagger that provides every running word with its corresponding POS-tag in view of its context. This tagged version consists of more than 160 different tags which, if properly combined, present a great potential for research on the levels of spelling and morphosyntax. The CQPweb version is also available for online use (https://latemodernmss.uma.es/cqpweb/).
The Corpus of Late Modern English Medical Writing (LMEMT) is the third and final component of The Corpus of Early English Medical Writing, compiled by Prof. Taavitsainen’s VARIENG research team at the University of Helsinki (Finland). Released in the year 2019, this part contains eighteenth-century printed material covering a wide range of text types of medical writing, from learned treatises and journal articles to health guides and popular books on medicine. Periodicals are also given room as a new type of writing in vogue throughout the eighteenth century.5 The corpus is offered in two formats, the annotated version using TEI-compliant XML and the unannotated plain-text version, both of them sold with the accompanying book Late Modern English Medical Texts (Taavitsainen and Hiltunen 2019).
This release of the corpus incorporates over two million words out of 628 sample texts, with a limitation of 10,000-word extracts in the case of longer texts. Notwithstanding the size of the product, the extracts have been carefully selected to represent the full range of printed medical texts in the eighteenth century. The resulting corpus incorporates heterogeneous material inasmuch as “the authors and the audiences of the texts extend from lay people to educated elites” (Suhr, Taavitsainen and Hiltunen 2024: this volume). It therefore stands out as an ideal source for insight into common grammatical and discursive features as well as text structure in view of the length of some extracts. If used in combination with the other components of The Corpus of Early English Medical Writing, the Helsinki product turns into an excellent primary source for diachronic research in English. As a matter of fact, Prof. Taavitsainen’s outstanding project has been the source of a plethora of research studies for the past two decades, with surely an endless list to come.
García-Bermejo Giner and Ruano-García’s The Salamanca Corpus (SC) is a digital archive of diachronic dialect material covering the period 1500–1950 and providing valuable information about dialect variation and change in EModE and LModE. Released in 2011, the current version of the corpus presents more than 14 million words stemming from novels, short stories, poetry collections and plays. Divided into three chronological periods (1500–1699, 1700–1799 and 1800–1950), the 571 digitized works contain references of the different dialect varieties, of which Lancashire, Yorkshire, Devonshire and Cornwall are those with highest number of references. As the authors explicitly state, the SC “represents our most important source of information about regional English variation in centuries from which we have very little contemporary information” (García-Bermejo Giner and Ruano-García 2024: this volume), thus becoming the ideal supplement to Joseph Wright’s English Dialect Dictionary (1898–1905).
The SC is also available for consultation in the project’s webpage (http://www.thesalamancacorpus.com), where users are offered the catalogues of the relevant texts for each variety and news about the texts incorporated into the archive, along with other information about research on the corpus. Of particular interest is the tool that enables the researcher to look for a particular word or string of words. As published in its webpage, the SC has been the source of information for a number of studies since the year 1991, a fact which corroborates the potential of the Salamanca project not only for the study of diachronic linguistics itself, but also for other areas such as dialectal variation and standardization in the Early and Late Modern periods of English (see, among others, Ruano-García 2020: 185–205).
The Mary Hamilton Papers is Denison, Yáñez-Bouza and Tino Oudesluijs’s proposal for the study of English in the period c.1740–c.1850. This compilation incorporates ego-documents related to Mary Hamilton (1756–1816), the bulk of which are her private letter collection along with “manuscript diaries”, “manuscript volumes” along with the additional documents related to Hamilton herself, all of which amounting to c.875,500 words. This unique material provides the opportunity “to exploit an almost untouched archive and answer questions about literary practice, letter-writing and everyday language use in Georgian England”, apart from “shedding light on the socio-economic, cultural and political landscape or the region and period in which they were created” (Denison, Yáñez-Bouza and Oudesluijs 2024: this volume). The corpus is divided into seven different periods in accordance with the important milestones of Mary Hamilton’s life, thus affording the opportunity to also investigate any particular feature also over time.
Following the rationale of other similar corpora, the corpus material is manually transcribed and made available, both in diplomatic and normalized versions, in the digitial edition hosted in Manchester Digital Collections. One of the great advantages of the project is the possibility of accessing the high-resolution images of the material, surely of interest not only to historical linguists but also to those from other side areas such as Manuscript Studies or Ecdotics. The normalized transcriptions make up the basis of the linguistic corpus, which is subsequently tagged for part of speech by the CLAWS tagger and for semantic analysis by the USAS tagger. Even though the transcribed version is accessible through the Manchester Digital Collections (hwww.digitalcollections.manchester.ac.uk/collections/maryhamilton/1), the tagged version is also indexed on CQPweb at Lancaster University. In a nutshell, The Mary Hamilton Papers will soon become, if not already, a compulsory source for the study of LModE language, history and culture in view of the uniqueness of the material, its handwritten condition and the potential of its part-of-speech and semantic annotations.
Details
- Pages
- 408
- Publication Year
- 2024
- ISBN (PDF)
- 9783034348263
- ISBN (ePUB)
- 9783034348270
- ISBN (Hardcover)
- 9783034346429
- DOI
- 10.3726/b21562
- Language
- English
- Publication date
- 2024 (May)
- Keywords
- Late Modern English Corpus Linguistics Langauge Standardisation Scientific Writing Medical English Irish English Late Modern English Dialects The Mary Hamilton Papers
- Published
- Lausanne, Berlin, Bruxelles, Chennai, New York, Oxford, 2024. 408 pp., 56 fig. b/w, 56 tables.
- Product Safety
- Peter Lang Group AG