Show Less

Explorations across Languages and Corpora

PALC 2009

Series:

Stanislaw Gozdz-Roszkowski

This volume attempts to keep track of the most recent developments in corpus-based, corpus-driven and corpus-informed studies. It signals the widening scope and perspectives on language and computers by documenting new developments and explorations in these areas encompassing an array of topics and themes, ranging from national corpora, corpus tools, information and terminology extraction through cognitive processes, discourse and ideology, academic discourse, translation, and lexicography to language teaching and learning. The contributions are drawn from a selection of papers presented at the 7 th Practical Applications in Language and Computers PALC conference held at the University of Łódź in 2009.

Prices

Show Summary Details
Restricted access

Part Two – Corpus Tools, Information and Terminology Extraction

Extract

UGTag: Morphological Analyzer and Tagger for the Ukrainian Language Natalia Kotsyba, Andriy Mykulyak and Igor V. Shevchenko Abstract: UGTag, a program for morphological analysis and tagging of Ukrainian texts, is developed within the Polish-Ukrainian Parallel Corpus (PolUKR)1 project to support morphosyntactic annotation for the Ukrainian part of the corpus. The tagger accepts plain, HTML or XML texts and produces XML files structured according to the XCES standard and suitable for search with such programs as Poliqarp. The process of the analysis consists of three stages: tokenization, tagging and chunking. At the tokenization stage the text is split into tokens (words, numbers, etc). During the tagging all possible morphological and lemma interpretations are assigned to each token (morphological analysis) at first, then the correct interpretation is selected (disambiguation). During the chunking stage tokens are grouped into sentences. The Ukrainian Grammatical Dictionary is used as a source of morphological information for UGTag. It is not restricted to it, however: modular design allows plugging-in additional dictionaries as well as modification of the existing one. Users can interact with UGTag in three ways: console-based, GUI and Web-based client. Keywords: corpus, grammatical dictionary, morphological analyzer, PolUKR, Slavic, tagger, UGTag, UGD, Ukrainian, XCES. 1. Introduction UGTag is a set of NLP tools for the Ukrainian language. Its development was inspired by the functionally similar TaKIPI2 toolset for Polish. There are two reasons for this. Firstly, TaKIPI is a very convenient software package with well thought out design that includes all major NLP tasks to prepare...

You are not authenticated to view the full text of this chapter or article.

This site requires a subscription or purchase to access the full text of books or journals.

Do you have any questions? Contact us.

Or login to access all content.