The Bulgarian Language Processing Chain (developed in 2011-2012) includes the following types of text processing and linguistic annotation:
• Sentence segmentation; • Tokenisation; • POS tagging and grammatical annotation; • Lemmatisation.
The Bulgarian POS tagger (BgTagger) marks up each word with the most probable Part of Speech and unambiguous morphosyntactic information among the set of tags associated with a given word. The tagger is based on SVM (Support Vector Machines) learning. The tagger predicts the POS tag of a word based on a set of features describing the word and its context. These features are:
- words, word bigrams and trigrams within a window of words around the currently tagged word;
- POS tags, POS tags bigrams and trigrams in the current window;
- information about suffixes, prefixes, capitalisation, hyphenation etc. for the unknown words.
The tagger is trained and tested on manually POS disambiguated corpus (BulPosCor). The strategy chosen for training the Bulgarian tagger is: (i) two passes in both directions; (ii) a window of five tokens, the currently tagged word being on the second position; (iii) two and three-grams of words or tags or ambiguity classes, lexical parameters as prefixes, suffixes, sentence borders, and capital letters.
The trained model is applied to disambiguate texts. The precision of the tagger up to the moment is 96,58%.
The Bulgarian lemmatiser determines the lemma and assigns it with detailed morphosyntactic annotation. The lemmatisation is based on an unambiguous association between the tagger output and information encoded in a large grammatical dictionary of Bulgarian. In the tagging process the lemmatiser uses a reduced tagset (75 word classes compared with 1029 unique grammatical tags in the dictionary), which is compiled in such a way as to ensure the minimum necessary information for unambiguous association with the respective lemma. A small number of rules and preferences are also implemented to limit the ambiguity in lemmatisation.
- tools for advanced processing and annotation;
- tools for annotation and alignment of parallel texts at sentential and subsentential level.
A highly scalable web service based infrastructure was developed to provide easy access to the tools for text processing and annotation of Bulgarian. Three different types of access are provided to facilitate the user access to the system:
- online access – suitable for users who need processing of relatively small amount of data occasionally;
- access via RESTful API – suitable for software developers who can integrate the processing tools in high level applications;
- asynchronous access – suitable for time-consuming tasks such as processing large corpora – the user uploads the archived corpus, it is processed on the server, a notification email is sent upon completion of the task, and the annotated corpus can be downloaded.
Fig. 1. Web infrastructure’s interface for asynchronous tasks
The major advantages of the infrastructure are:
- affords high-quality linguistic processing of Bulgarian language resources;
- supplies complex and compatible multi-level annotations; • based on state-of-the-art technologies;
- provides different levels of access that cater for the particular needs of different types of users; • highly scalable, can be distributed on different machines.
- Programming language: C++. PHP;
- Performance: Linux
The Web-Based Infrastructure for Bulgarian Data Processing is available on request. Please contact: firstname.lastname@example.org.
Koeva, Sv., Genov, A. Bulgarian Language Processing Chain. In Proceeding of the Workshop on the Integration of Multilingual Resources and Tools in Web Applications, 26 September 2011, Hamburg.
For more information on how to use the Web-based Infrastructure consult the WebInfrastructure User Manual.
Contact person: Martin Yalamov