Content

Bulgarian texts

The Bulgarian National Corpus was established in 2009 as a result of the growing need for high quality language resources for the purposes of Bulgarian computational linguistics, various computer application for natural language processing, theoretical linguistics, etc. Initially, the BulNC was created as a monolingual corpus of Bulgarian texts. The enrichment of the BulNC with parallel corpora in recent years is due to the expansion of the research interests in the area of computational linguistics towards multilingual applications – machine translation, information extraction from multilingual sources, etc.

 

The core of the BulNC consists of Bulgarian texts – over 240 000 text samples amounting to 1.2 billion tokens.

 

The original Bulgarian texts comprise 37.1% of the corpus, the translated texts – 40.5%, and the for the remaining 22.4% the source and the direction of translation is not known.

 

There are also texts of different modality: predominantly written (97.35%) with spoken texts (2.65%) of limited types – lectures, parliamentary proceedings and subtitles.

 

The majority of the texts (97.5%) are obtained from the internet either through automatic crawling or manual downloading, while the remaining 2.5% are provided by the authors or publishers.

 

The distribution of texts with respect to styles is presented on the figure below.

The parallel corpora within BulNC are collectively named Bul-X-Cor and comprise 47 corpora for different languages which have been compiled with Bulgarian as a pivot language. The parallel corpora vary in terms of size and diversity depending on the availability of parallel texts for the particular language pair. The parallel corpora cover English, German, French, most Slavic and Balkan languages, as well as many other European and non-European (both taxonomically and geographically) languages.

Organisation of the parallel corpora within BulNC

Each parallel corpus consists exclusively of texts that have a correspondence in Bulgarian – either the original or a translation. Both texts may be translations from a third language. The parallel corpora are part of the Bulgarian National Corpus (BulNC). Their structure, data format and description follow closely the model of the BulNC. The texts are supplied with detailed metadata, extracted automatically wherever possible, and manually elaborated, if necessary.

 

The main principle of organisation of the corpus is demonstrated on the diagram below. Each text is stored only once. Each parallel equivalent is directly related to its Bulgarian counterpart and indirectly – to its counterparts on other languages, if they exist.

 

The main principle of organisation of the corpus.

 

The structure of each parallel corpus reflects the structure of the core of BulNC – the same classification based on style, domain and genre, is adopted.

Size of parallel corpora

The parallel corpora are being constantly enlarged so that a greater variety of styles, thematic domains and genres may be attained. Currently (end-January 2013), the overall size of the parallel corpora amounts to 4.2 billion words.

 

The largest parallel corpus is the Bulgarian-English parallel corpus of approximately 260 million tokens per language. Further, there are six corpora with size of 200-250 million tokens per language, 14 corpora with size of 150-200 million, three corpora with size of 100-150 million. The remaining corpora are relatively small: 11 corpora with size of 1-15 million tokens and 15 with size below 1 million. The smallest parallel corpus is the Bulgarian-Japanese one with ariund 50,000 words per language.

 

Largest parallel corpora within BulNC.

 

Parallel corpus Lang. code Number of texts Number of tokens
Bulgarian-English BG-EN 113545 260681821
Bulgarian-Romanian BG-RO 114440 235859637
Bulgarian-French BG-FR 71935 231486663
Bulgarian-Greek BG-EL 113849 229749068
Bulgarian-Portuguese BG-PT 70697 211824204
Bulgarian-Italian BG-IT 71195 209083677
Bulgarian-Dutch BG-NL 70629 204309755
Bulgarian-Polish BG-PL 78055 197762449
Bulgarian-Czech BG-CS 72545 196769297
Bulgarian-German BG-DE 77502 194497872
Bulgarian-Spanish BG-ES 62879 191092782
Bulgarian-Danish BG-DA 71316 190843358
Bulgarian-Slovak BG-SK 71790 189752630
Bulgarian-Slovene BG-SL 71343 188776967
Bulgarian-Hungarian BG-HU 71618 183530929
Bulgarian-Swedish BG-SV 70115 180752058
Bulgarian-Lithuanian BG-LT 70858 170381570
Bulgarian-Latvian BG-LV 70015 167600804
Bulgarian-Maltese BG-MT 65218 163515445
Bulgarian-Estonian BG-ET 71558 160175247
Bulgarian-Finnish BG-FI 71247 156288741
Bulgarian-Turkish BG-TR 36655 13297328
Bulgarian-Irish BG-GA 2230 13287693
Bulgarian-Croatian BG-HR 33948 11950183
Bulgarian-Albanian BG-SQ 35787 9781443
Bulgarian-Macedonian BG-MK 35761 9542940
Bulgarian-Bosnian BG-BS 20736 6195646
Bulgarian-Russian BG-RU 211 3293243
Bulgarian-Hebrew BG-HE 446 2872765
Bulgarian-Arabic BG-AR 370 2446857
Bulgarian-Serbian BG-SR 865 1832323
Bulgarian-Norwegian BG-NO 173 1588561
Bulgarian-Icelandic BG-IS 41 762894
Bulgarian-Ukrainian BG-UK 40 744815
Bulgarian-Catalan BG-CA 26 640522
Bulgarian-Galician BG-GL 25 629272
Bulgarian-Kazakh BG-KK 29 486766
Bulgarian-Basque BG-EU 25 461080
Bulgarian-Chinese BG-ZH 34 229293
Bulgarian-Tajik BG-TG 16 160123
Bulgarian-Armenian BG-HY 16 139802
Bulgarian-Azerbaijani BG-AZ 16 137238
Bulgarian-Mongolian BG-MN 16 135076
Bulgarian-Kyrgyz BG-KY 16 135031
Bulgarian-Georgian BG-KA 16 128502
Bulgarian-Turkmen BG-TK 15 127430
Bulgarian-Japanese BG-JA 10 50194
Total 1,789,872 4,195,791,994

 

Size of the Parallel corpora in number of texts and number of words.

Bulgarian-English parallel corpus

The largest parallel corpus within BulNC is the Bulgarian-English parallel corpus which comprises 260.7 million tokens for English and 263.1 million tokens for Bulgarian. The distribution of texts with respect to styles in the Bulgarian-English parallel corpus is shown on the diagram below.

Distribution of texts with respect to styles in the Bulgarian-English parallel corpus.

The Bulgarian-English parallel corpus has been used for various research tasks. The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English parallel corpus. BulENAC has been used for NLP applications for text alignment and machine translation.