Principles of annotation of BulNC
Linguistic annotation increases the value of a corpus by making it more usable, as various kinds of information may be extracted, more multifunctional, as the corpus may be used for different purposes, and?more explicit with respect to the analysed information. In our approach, we adopt the following set of criteria for quality annotation:?
- Multi-layered – the more richly annotated the corpus, the broader its range of applications for research and applied studies. Corpus processing needs to cover and accumulate as many levels of linguistic annotation as possible.?
- Compliance with standards in data formatting and representation of annotation. Unification of various tagsets and data formats, including encodings, is enabled through easy and reliable conversion.
- Uniformity – a common set of attributes and values for different languages and different media types – text, audio, image, video, and common techniques to manage (accumulate, combine, split, etc.) them. This will facilitate comparative studies and the application of language-independent tools.
- Consistency – as annotation of large amounts of texts in most cases is carried out automatically, it is necessary to provide means for validation and evaluation.
The following annotation principles are observed in general, both for manual and automatic annotation: the input text remains unchanged; the annotation is performed at consecutive stages and is accumulated as multi-level annotation; the annotation data are represented as attribute-value pairs. Each annotation level is independent, may be accessed separately and merged with other compatible annotation schemes.
The Bulgarian texts are annotated using the Bulgarian language processing chain. It integrates a number of tools (a regular expression-based sentence splitter and tokeniser, an SVM POS-tagger, a dictionary-based lemmatiser, a finite-state chunker, and a wordnet sense annotation tool), designed to work together and to ensure interoperability, fast performance and high accuracy. The training of the Bulgarian tagger is based on the following parameters: two passes in both directions; a window of five tokens, the currently tagged word being in second position; 2- and 3-grams of words or morphosyntactic tags or ambiguity classes; lexical parameters such as prefixes, suffixes, sentence borders, and capital letters. Lemmatisation is based on linking the tagger output to the Grammatical dictionary (75 word classes to 1029 unique grammatical tags in the dictionary), while a number of rules and preferences are applied to resolve the ambiguities. The chunker is a rule-based parser working with a manually crafted grammar designed to recognise unambiguous phrases and their heads.
Apache OpenNLP with pre-trained models and Stanford CoreNLP are used for the annotation of the English texts – sentence segmentation, tokenisation, and POS tagging. OpenNLP could be trained and applied for other languages as well. There are also some pretrained models for a number of widely used languages (German and Spanish, among others). Lemmatisation of the English texts is performed using Stanford CoreNLP and RASP. As we aim at high quality and consistency of the annotation, we examine various systems for processing English and other languages.
Uniformity in annotation for Bulgarian and other languages is achieved in either of two ways:
- annotation of raw data from scratch, applying equal standards and principles, or
- conversion of already existing annotation.
In each case the tagset and conventions accepted for the BulNC are followed. The different tagsets are mapped to the Bulgarian tagset, but any language-specific annotation is preserved. The design of the Bulgarian tagset provides a uniform description of the inflexion of Bulgarian words and multiword expressions based on morphological and morphosyntactic criteria. The tagset is mappable to the Multext-East morphosyntactic descriptions, which are valuable as a unified framework for many European languages, although some disadvantages have been discovered with regard to the set of descriptions, both on a general and a language-specific level.
Parallel corpus alignment
Alignment at sentence level is essential for all parallel resources and it is therefore required for all language pairs. High-quality sentence segmentation is an important prerequisite for the quality of parallel text alignment. The vast majority of the errors that occur in sentence alignment follow from inaccurate sentence segmentation. Two aligners have been applied for parts of the corpus:HunAlign and Maligna.
The alignment is based on the Gale-Church algorithm, which uses sentence length distance measure and is largely language independent. Other alignment methods, such as the Bilingual Sentence Aligner and the use of bilingual dictionaries, are envisaged as well. The aligners take as input texts with segmented sentences and produce a sequence of parallel sentence pairs (bi-sentences). At present, alignment is performed and tested on the Bulgarian-English Parallel Corpus.
A further step in parallel corpora processing is automatic alignment at subsentential level: clause alignment (cf. BulEnAC corpus), phrase or word alignment.