Bulgarian-X language Parallel Corpus

The Bulgarian-X language Parallel Corpus (Bul-X-Cor) is a part of the Bulgarian National Corpus (BulNC). The Bulgarian National Corpus is designed as a uniform framework for texts of different modality (written – spoken), period (synchronic – diachronic), and number of languages (monolingual – parallel where one of the counterparts is Bulgarian). Any X-language in the corpus is equally treated with respect to the text type diversity and balance, metadata description scheme, preprocessing and annotation, search engine queries and data storage format.

Bulgarian-X Language Parallel Corpus includes parallel corpora of 48 languages – English, German, French, Slavic and Balkan languages, as well as other European and non-European languages.

The parallel corpora represent only texts which have a Bulgarian correspondence – either the original is in Bulgarian, there is a Bulgarian translation, or both texts are translations from a third language.

As of January 2013, the Bulgarian-X Language Parallel Corpus contains 4.2 billion tokens, comprising the biggest parallel corpus of Bulgarian. Languages are not equally represented: the largest parallel corpus is the Bulgarian-English parallel corpus (280.8 and 283.1 million words for Bulgarian and English respectively); there are 18 other corpora of over 200 million tokens per language, 2 parallel corpora between 100 and 200 million tokens per language, 11 parallel corpora of size in the range 5-15 million tokens per language, and the rest 15 are below 1 million, with the smallest corpus being Japanese with 50,000 tokens. Each parallel subcorpus within Bul-X-Cor mirrors the structure of BulNC.

The structure, data formatting and text description follow the model of BulNC. All Bulgarian texts in BulNC and English texts in Bul-X-Cor are supplied with extensive metadata description compliant with the well established standards. The Bulgarian-English parallel corpus is supplied as well with annotation on various levels while the annotation of other languages has just started.

Main applications of parallel corpora are in the field of computational linguistics: machine translation, developing bilingual lexical resources (dictionaries), etc. The benefits of the parallel corpora increase if they are annotated.

Copyright © 2015 Department of computational linguistics. All rights reserved.