The Bulgarian National Corpus was established in 2009 as a result of the growing need for high quality language resources for the purposes of Bulgarian computational linguistics, various computer application for natural language processing, theoretical linguistics, etc. Initially, the BulNC was created as a monolingual corpus of Bulgarian texts. The enrichment of the BulNC with parallel corpora in recent years is due to the expansion of the research interests in the area of computational linguistics towards multilingual applications – machine translation, information extraction from multilingual sources, etc.
The core of the BulNC consists of Bulgarian texts – over 240 000 text samples amounting to 1.2 billion tokens.
The original Bulgarian texts comprise 37.1% of the corpus, the translated texts – 40.5%, and the for the remaining 22.4% the source and the direction of translation is not known.
There are also texts of different modality: predominantly written (97.35%) with spoken texts (2.65%) of limited types – lectures, parliamentary proceedings and subtitles.
The majority of the texts (97.5%) are obtained from the internet either through automatic crawling or manual downloading, while the remaining 2.5% are provided by the authors or publishers.
The distribution of texts with respect to styles is presented on the figure below.
Each parallel corpus consists exclusively of texts that have a correspondence in Bulgarian – either the original or a translation. Both texts may be translations from a third language. The parallel corpora are part of the Bulgarian National Corpus (BulNC). Their structure, data format and description follow closely the model of the BulNC. The texts are supplied with detailed metadata, extracted automatically wherever possible, and manually elaborated, if necessary.
The main principle of organisation of the corpus is demonstrated on the diagram below. Each text is stored only once. Each parallel equivalent is directly related to its Bulgarian counterpart and indirectly – to its counterparts on other languages, if they exist.
The main principle of organisation of the corpus.
The structure of each parallel corpus reflects the structure of the core of BulNC – the same classification based on style, domain and genre, is adopted.
The parallel corpora are being constantly enlarged so that a greater variety of styles, thematic domains and genres may be attained. Currently (end-January 2013), the overall size of the parallel corpora amounts to 4.2 billion words.
The largest parallel corpus is the Bulgarian-English parallel corpus of approximately 260 million tokens per language. Further, there are six corpora with size of 200-250 million tokens per language, 14 corpora with size of 150-200 million, three corpora with size of 100-150 million. The remaining corpora are relatively small: 11 corpora with size of 1-15 million tokens and 15 with size below 1 million. The smallest parallel corpus is the Bulgarian-Japanese one with ariund 50,000 words per language.
Largest parallel corpora within BulNC.
Parallel corpus | Lang. code | Number of texts | Number of tokens |
Bulgarian-English | BG-EN | 113545 | 260681821 |
Bulgarian-Romanian | BG-RO | 114440 | 235859637 |
Bulgarian-French | BG-FR | 71935 | 231486663 |
Bulgarian-Greek | BG-EL | 113849 | 229749068 |
Bulgarian-Portuguese | BG-PT | 70697 | 211824204 |
Bulgarian-Italian | BG-IT | 71195 | 209083677 |
Bulgarian-Dutch | BG-NL | 70629 | 204309755 |
Bulgarian-Polish | BG-PL | 78055 | 197762449 |
Bulgarian-Czech | BG-CS | 72545 | 196769297 |
Bulgarian-German | BG-DE | 77502 | 194497872 |
Bulgarian-Spanish | BG-ES | 62879 | 191092782 |
Bulgarian-Danish | BG-DA | 71316 | 190843358 |
Bulgarian-Slovak | BG-SK | 71790 | 189752630 |
Bulgarian-Slovene | BG-SL | 71343 | 188776967 |
Bulgarian-Hungarian | BG-HU | 71618 | 183530929 |
Bulgarian-Swedish | BG-SV | 70115 | 180752058 |
Bulgarian-Lithuanian | BG-LT | 70858 | 170381570 |
Bulgarian-Latvian | BG-LV | 70015 | 167600804 |
Bulgarian-Maltese | BG-MT | 65218 | 163515445 |
Bulgarian-Estonian | BG-ET | 71558 | 160175247 |
Bulgarian-Finnish | BG-FI | 71247 | 156288741 |
Bulgarian-Turkish | BG-TR | 36655 | 13297328 |
Bulgarian-Irish | BG-GA | 2230 | 13287693 |
Bulgarian-Croatian | BG-HR | 33948 | 11950183 |
Bulgarian-Albanian | BG-SQ | 35787 | 9781443 |
Bulgarian-Macedonian | BG-MK | 35761 | 9542940 |
Bulgarian-Bosnian | BG-BS | 20736 | 6195646 |
Bulgarian-Russian | BG-RU | 211 | 3293243 |
Bulgarian-Hebrew | BG-HE | 446 | 2872765 |
Bulgarian-Arabic | BG-AR | 370 | 2446857 |
Bulgarian-Serbian | BG-SR | 865 | 1832323 |
Bulgarian-Norwegian | BG-NO | 173 | 1588561 |
Bulgarian-Icelandic | BG-IS | 41 | 762894 |
Bulgarian-Ukrainian | BG-UK | 40 | 744815 |
Bulgarian-Catalan | BG-CA | 26 | 640522 |
Bulgarian-Galician | BG-GL | 25 | 629272 |
Bulgarian-Kazakh | BG-KK | 29 | 486766 |
Bulgarian-Basque | BG-EU | 25 | 461080 |
Bulgarian-Chinese | BG-ZH | 34 | 229293 |
Bulgarian-Tajik | BG-TG | 16 | 160123 |
Bulgarian-Armenian | BG-HY | 16 | 139802 |
Bulgarian-Azerbaijani | BG-AZ | 16 | 137238 |
Bulgarian-Mongolian | BG-MN | 16 | 135076 |
Bulgarian-Kyrgyz | BG-KY | 16 | 135031 |
Bulgarian-Georgian | BG-KA | 16 | 128502 |
Bulgarian-Turkmen | BG-TK | 15 | 127430 |
Bulgarian-Japanese | BG-JA | 10 | 50194 |
Total | 1,789,872 | 4,195,791,994 |
Size of the Parallel corpora in number of texts and number of words.
The largest parallel corpus within BulNC is the Bulgarian-English parallel corpus which comprises 260.7 million tokens for English and 263.1 million tokens for Bulgarian. The distribution of texts with respect to styles in the Bulgarian-English parallel corpus is shown on the diagram below.
The Bulgarian-English parallel corpus has been used for various research tasks. The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English parallel corpus. BulENAC has been used for NLP applications for text alignment and machine translation.