Content

Bulgarian texts

The Bulgarian National Corpus was established in 2009 as a result of the growing need for high quality language resources for the purposes of Bulgarian computational linguistics, various computer application for natural language processing, theoretical linguistics, etc. Initially, the BulNC was created as a monolingual corpus of Bulgarian texts. The enrichment of the BulNC with parallel corpora in recent years is due to the expansion of the research interests in the area of computational linguistics towards multilingual applications – machine translation, information extraction from multilingual sources, etc.

The core of the BulNC consists of Bulgarian texts – over 240 000 text samples amounting to 1.2 billion tokens.

The original Bulgarian texts comprise 37.1% of the corpus, the translated texts – 40.5%, and the for the remaining 22.4% the source and the direction of translation is not known.

There are also texts of different modality: predominantly written (97.35%) with spoken texts (2.65%) of limited types – lectures, parliamentary proceedings and subtitles.

The majority of the texts (97.5%) are obtained from the internet either through automatic crawling or manual downloading, while the remaining 2.5% are provided by the authors or publishers.

The distribution of texts with respect to styles is presented on the figure below.

The parallel corpora within BulNC are collectively named Bul-X-Cor and comprise 47 corpora for different languages which have been compiled with Bulgarian as a pivot language. The parallel corpora vary in terms of size and diversity depending on the availability of parallel texts for the particular language pair. The parallel corpora cover English, German, French, most Slavic and Balkan languages, as well as many other European and non-European (both taxonomically and geographically) languages.

Organisation of the parallel corpora within BulNC

Each parallel corpus consists exclusively of texts that have a correspondence in Bulgarian – either the original or a translation. Both texts may be translations from a third language. The parallel corpora are part of the Bulgarian National Corpus (BulNC). Their structure, data format and description follow closely the model of the BulNC. The texts are supplied with detailed metadata, extracted automatically wherever possible, and manually elaborated, if necessary.

The main principle of organisation of the corpus is demonstrated on the diagram below. Each text is stored only once. Each parallel equivalent is directly related to its Bulgarian counterpart and indirectly – to its counterparts on other languages, if they exist.

The main principle of organisation of the corpus.

The structure of each parallel corpus reflects the structure of the core of BulNC – the same classification based on style, domain and genre, is adopted.

Size of parallel corpora

The parallel corpora are being constantly enlarged so that a greater variety of styles, thematic domains and genres may be attained. Currently (end-January 2013), the overall size of the parallel corpora amounts to 4.2 billion words.

The largest parallel corpus is the Bulgarian-English parallel corpus of approximately 260 million tokens per language. Further, there are six corpora with size of 200-250 million tokens per language, 14 corpora with size of 150-200 million, three corpora with size of 100-150 million. The remaining corpora are relatively small: 11 corpora with size of 1-15 million tokens and 15 with size below 1 million. The smallest parallel corpus is the Bulgarian-Japanese one with ariund 50,000 words per language.

Largest parallel corpora within BulNC.

Parallel corpus	Lang. code	Number of texts	Number of tokens
Bulgarian-English	BG-EN	113545	260681821
Bulgarian-Romanian	BG-RO	114440	235859637
Bulgarian-French	BG-FR	71935	231486663
Bulgarian-Greek	BG-EL	113849	229749068
Bulgarian-Portuguese	BG-PT	70697	211824204
Bulgarian-Italian	BG-IT	71195	209083677
Bulgarian-Dutch	BG-NL	70629	204309755
Bulgarian-Polish	BG-PL	78055	197762449
Bulgarian-Czech	BG-CS	72545	196769297
Bulgarian-German	BG-DE	77502	194497872
Bulgarian-Spanish	BG-ES	62879	191092782
Bulgarian-Danish	BG-DA	71316	190843358
Bulgarian-Slovak	BG-SK	71790	189752630
Bulgarian-Slovene	BG-SL	71343	188776967
Bulgarian-Hungarian	BG-HU	71618	183530929
Bulgarian-Swedish	BG-SV	70115	180752058
Bulgarian-Lithuanian	BG-LT	70858	170381570
Bulgarian-Latvian	BG-LV	70015	167600804
Bulgarian-Maltese	BG-MT	65218	163515445
Bulgarian-Estonian	BG-ET	71558	160175247
Bulgarian-Finnish	BG-FI	71247	156288741
Bulgarian-Turkish	BG-TR	36655	13297328
Bulgarian-Irish	BG-GA	2230	13287693
Bulgarian-Croatian	BG-HR	33948	11950183
Bulgarian-Albanian	BG-SQ	35787	9781443
Bulgarian-Macedonian	BG-MK	35761	9542940
Bulgarian-Bosnian	BG-BS	20736	6195646
Bulgarian-Russian	BG-RU	211	3293243
Bulgarian-Hebrew	BG-HE	446	2872765
Bulgarian-Arabic	BG-AR	370	2446857
Bulgarian-Serbian	BG-SR	865	1832323
Bulgarian-Norwegian	BG-NO	173	1588561
Bulgarian-Icelandic	BG-IS	41	762894
Bulgarian-Ukrainian	BG-UK	40	744815
Bulgarian-Catalan	BG-CA	26	640522
Bulgarian-Galician	BG-GL	25	629272
Bulgarian-Kazakh	BG-KK	29	486766
Bulgarian-Basque	BG-EU	25	461080
Bulgarian-Chinese	BG-ZH	34	229293
Bulgarian-Tajik	BG-TG	16	160123
Bulgarian-Armenian	BG-HY	16	139802
Bulgarian-Azerbaijani	BG-AZ	16	137238
Bulgarian-Mongolian	BG-MN	16	135076
Bulgarian-Kyrgyz	BG-KY	16	135031
Bulgarian-Georgian	BG-KA	16	128502
Bulgarian-Turkmen	BG-TK	15	127430
Bulgarian-Japanese	BG-JA	10	50194
Total		1,789,872	4,195,791,994

Size of the Parallel corpora in number of texts and number of words.

Bulgarian-English parallel corpus

The largest parallel corpus within BulNC is the Bulgarian-English parallel corpus which comprises 260.7 million tokens for English and 263.1 million tokens for Bulgarian. The distribution of texts with respect to styles in the Bulgarian-English parallel corpus is shown on the diagram below.

Distribution of texts with respect to styles in the Bulgarian-English parallel corpus.

The Bulgarian-English parallel corpus has been used for various research tasks. The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English parallel corpus. BulENAC has been used for NLP applications for text alignment and machine translation.