Linguistic annotation increases the value of a corpus by making it more usable, as various kinds of information may be extracted, more multifunctional, as the corpus may be used for different purposes, and?more explicit with respect to the analysed information. In our approach, we adopt the following set of criteria for quality annotation:?
The following annotation principles are observed in general, both for manual and automatic annotation: the input text remains unchanged; the annotation is performed at consecutive stages and is accumulated as multi-level annotation; the annotation data are represented as attribute-value pairs. Each annotation level is independent, may be accessed separately and merged with other compatible annotation schemes.
The Bulgarian texts are annotated using the Bulgarian language processing chain. It integrates a number of tools (a regular expression-based sentence splitter and tokeniser, an SVM POS-tagger, a dictionary-based lemmatiser, a finite-state chunker, and a wordnet sense annotation tool), designed to work together and to ensure interoperability, fast performance and high accuracy. The training of the Bulgarian tagger is based on the following parameters: two passes in both directions; a window of five tokens, the currently tagged word being in second position; 2- and 3-grams of words or morphosyntactic tags or ambiguity classes; lexical parameters such as prefixes, suffixes, sentence borders, and capital letters. Lemmatisation is based on linking the tagger output to the Grammatical dictionary (75 word classes to 1029 unique grammatical tags in the dictionary), while a number of rules and preferences are applied to resolve the ambiguities. The chunker is a rule-based parser working with a manually crafted grammar designed to recognise unambiguous phrases and their heads.
Apache OpenNLP with pre-trained models and Stanford CoreNLP are used for the annotation of the English texts – sentence segmentation, tokenisation, and POS tagging. OpenNLP could be trained and applied for other languages as well. There are also some pretrained models for a number of widely used languages (German and Spanish, among others). Lemmatisation of the English texts is performed using Stanford CoreNLP and RASP. As we aim at high quality and consistency of the annotation, we examine various systems for processing English and other languages.
In each case the tagset and conventions accepted for the BulNC are followed. The different tagsets are mapped to the Bulgarian tagset, but any language-specific annotation is preserved. The design of the Bulgarian tagset provides a uniform description of the inflexion of Bulgarian words and multiword expressions based on morphological and morphosyntactic criteria. The tagset is mappable to the Multext-East morphosyntactic descriptions, which are valuable as a unified framework for many European languages, although some disadvantages have been discovered with regard to the set of descriptions, both on a general and a language-specific level.
Alignment at sentence level is essential for all parallel resources and it is therefore required for all language pairs. High-quality sentence segmentation is an important prerequisite for the quality of parallel text alignment. The vast majority of the errors that occur in sentence alignment follow from inaccurate sentence segmentation. Two aligners have been applied for parts of the corpus:HunAlign and Maligna.
The alignment is based on the Gale-Church algorithm, which uses sentence length distance measure and is largely language independent. Other alignment methods, such as the Bilingual Sentence Aligner and the use of bilingual dictionaries, are envisaged as well. The aligners take as input texts with segmented sentences and produce a sequence of parallel sentence pairs (bi-sentences). At present, alignment is performed and tested on the Bulgarian-English Parallel Corpus.
A further step in parallel corpora processing is automatic alignment at subsentential level: clause alignment (cf. BulEnAC corpus), phrase or word alignment.