The Bulgarian National Corpus was established in 2009 as a result of the growing need for high quality language resources for the purposes of Bulgarian computational linguistics, various computer application for natural language processing, theoretical linguistics, etc. Initially, the BulNC was created as a monolingual corpus of Bulgarian texts. The enrichment of the BulNC with parallel corpora in recent years is due to the expansion of the research interests in the area of computational linguistics towards multilingual applications – machine translation, information extraction from multilingual sources, etc.
The core of the BulNC consists of Bulgarian texts – over 240 000 text samples amounting to 1.2 billion tokens.
The original Bulgarian texts comprise 37.1% of the corpus, the translated texts – 40.5%, and the for the remaining 22.4% the source and the direction of translation is not known.
There are also texts of different modality: predominantly written (97.35%) with spoken texts (2.65%) of limited types – lectures, parliamentary proceedings and subtitles.
The majority of the texts (97.5%) are obtained from the internet either through automatic crawling or manual downloading, while the remaining 2.5% are provided by the authors or publishers.
The distribution of texts with respect to styles is presented on the figure below.
The parallel corpora within BulNC are collectively named Bul-X-Cor and comprise 47 corpora for different languages which have been compiled with Bulgarian as a pivot language. The parallel corpora vary in terms of size and diversity depending on the availability of parallel texts for the particular language pair. The parallel corpora cover English, German, French, most Slavic and Balkan languages, as well as many other European and non-European (both taxonomically and geographically) languages.
Organisation of the parallel corpora within BulNC
Each parallel corpus consists exclusively of texts that have a correspondence in Bulgarian – either the original or a translation. Both texts may be translations from a third language. The parallel corpora are part of the Bulgarian National Corpus (BulNC). Their structure, data format and description follow closely the model of the BulNC. The texts are supplied with detailed metadata, extracted automatically wherever possible, and manually elaborated, if necessary.
The main principle of organisation of the corpus is demonstrated on the diagram below. Each text is stored only once. Each parallel equivalent is directly related to its Bulgarian counterpart and indirectly – to its counterparts on other languages, if they exist.
The main principle of organisation of the corpus.
The structure of each parallel corpus reflects the structure of the core of BulNC – the same classification based on style, domain and genre, is adopted.
Size of parallel corpora
The parallel corpora are being constantly enlarged so that a greater variety of styles, thematic domains and genres may be attained. Currently (end-January 2013), the overall size of the parallel corpora amounts to 4.2 billion words.
The largest parallel corpus is the Bulgarian-English parallel corpus of approximately 260 million tokens per language. Further, there are six corpora with size of 200-250 million tokens per language, 14 corpora with size of 150-200 million, three corpora with size of 100-150 million. The remaining corpora are relatively small: 11 corpora with size of 1-15 million tokens and 15 with size below 1 million. The smallest parallel corpus is the Bulgarian-Japanese one with ariund 50,000 words per language.
Largest parallel corpora within BulNC.
||Number of texts
||Number of tokens
Size of the Parallel corpora in number of texts and number of words.
Bulgarian-English parallel corpus
The largest parallel corpus within BulNC is the Bulgarian-English parallel corpus which comprises 260.7 million tokens for English and 263.1 million tokens for Bulgarian. The distribution of texts with respect to styles in the Bulgarian-English parallel corpus is shown on the diagram below.
Distribution of texts with respect to styles in the Bulgarian-English parallel corpus.
The Bulgarian-English parallel corpus has been used for various research tasks. The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English parallel corpus. BulENAC has been used for NLP applications for text alignment and machine translation.