On the automatic analysis of learner corpora. Native Language Identification as experimental testbed of language modeling between surface features and linguistic abstraction: Detmar Meurers / Julia Krivanek / Serhiy Bykh
DETMAR MEURERS / JULIA KRIVANEK / SERHIY BYKH
On the automatic analysis of learner corpora. Native Language Identification as experimental testbed of language modeling between surface features and linguistic abstraction1
Learner corpora as collections of language produced by language learners have been systematically collected since the 90’s, and with readily available collections such as the International Corpus of Learner English (ICLE) (Granger et al. 2009) for English and Falko (Lernerkorpus des Deutschen als Fremdsprache) (Lüdeling et al. 2008) for German there is a growing empirical basis on which theories about second language acquisition and the linguistic system can be informed and applications can be tested.
While most research on learner corpora has analyzed the (co)occurrence of (sequences of) words or manual error annotation, tools for automatically analyzing large corpora in terms of linguistic abstractions such as parts-of-speech, syntactic constituency, or dependency are increasingly available. Similar to the discussion about the role of exemplars vs. prototypes in language, this situation raises the question when to consider surface forms as such and when linguistic categories abstracting and generalizing over surface forms are useful in a corpus-based analysis of learner language.