Edited By Stanislaw Gozdz-Roszkowski
Part Two – Corpus Tools, Information and Terminology Extraction
UGTag: Morphological Analyzer and Tagger for the Ukrainian Language Natalia Kotsyba, Andriy Mykulyak and Igor V. Shevchenko Abstract: UGTag, a program for morphological analysis and tagging of Ukrainian texts, is developed within the Polish-Ukrainian Parallel Corpus (PolUKR)1 project to support morphosyntactic annotation for the Ukrainian part of the corpus. The tagger accepts plain, HTML or XML texts and produces XML files structured according to the XCES standard and suitable for search with such programs as Poliqarp. The process of the analysis consists of three stages: tokenization, tagging and chunking. At the tokenization stage the text is split into tokens (words, numbers, etc). During the tagging all possible morphological and lemma interpretations are assigned to each token (morphological analysis) at first, then the correct interpretation is selected (disambiguation). During the chunking stage tokens are grouped into sentences. The Ukrainian Grammatical Dictionary is used as a source of morphological information for UGTag. It is not restricted to it, however: modular design allows plugging-in additional dictionaries as well as modification of the existing one. Users can interact with UGTag in three ways: console-based, GUI and Web-based client. Keywords: corpus, grammatical dictionary, morphological analyzer, PolUKR, Slavic, tagger, UGTag, UGD, Ukrainian, XCES. 1. Introduction UGTag is a set of NLP tools for the Ukrainian language. Its development was inspired by the functionally similar TaKIPI2 toolset for Polish. There are two reasons for this. Firstly, TaKIPI is a very convenient software package with well thought out design that includes all major NLP tasks to prepare...
You are not authenticated to view the full text of this chapter or article.
This site requires a subscription or purchase to access the full text of books or journals.
Do you have any questions? Contact us.Or login to access all content.