Show Less

Corpus Data across Languages and Disciplines

Series:

Edited By Piotr Pezik

Over the recent years corpus tools and methodologies have gained widespread recognition in various areas of theoretical and applied linguistics. Data lodged in corpora is explored and exploited across languages and disciplines as distinct as historical linguistics, language didactics, discourse analysis, machine translation and search engine development to name but a few. This volume contains a selection of papers presented at the 8 th edition of the Practical Applications in Language and Computers conference and it is aimed at helping a wide community of researchers, language professionals and practitioners keep up to date with new corpus theories and methodologies as well as language-related applications of computational tools and resources.

Prices

Show Summary Details
Restricted access

CzeSL – an Error Tagged Corpus of Czech as a Second Language: Barbora Štindlová, Alexandr Rosen, Jirka Hana and Svatava Škodová

Extract

CzeSL – an Error Tagged Corpus of Czech as a Second Language Barbora Štindlová, Alexandr Rosen, Jirka Hana and Svatava Škodová Abstract Using an error-annotated learner corpus as the basis, the goal of this paper is two-fold: (i) to evaluate the practicality of the annotation scheme by computing inter-annotator agreement on a non-trivial sample of data, and (ii) to find out whether the application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text is viable as a substitute for manual annotation. Keywords Learner corpus, error annotation, second language acquisition Introduction Texts produced by non-native speakers are a precious source of information about the acquisition of a language by the learners and about second language acquisition in general. Collections of such texts – learner corpora – can be annotated in a way similar to other corpora with morphosyntactic categories or syntactic structure. However, their most interesting aspect is examples of deviant use, which can be corrected and assigned a tag specifying the type of error. Annotation of this kind is a challenging task, even more so for a language such as Czech, with its rich inflection, derivation, agreement, and a largely information-structure-driven constituent order. The present work is based on a project aimed at building a learner corpus with errors manually corrected and labelled within a three-level annotation scheme. Manual annotation is supplemented by morphosyntactic tags assigned to the hand- corrected input by a tagger, and by additional error tags, whenever they can be derived automatically. Options to...

You are not authenticated to view the full text of this chapter or article.

This site requires a subscription or purchase to access the full text of books or journals.

Do you have any questions? Contact us.

Or login to access all content.