The Bulgarian National corpus is being created at the Institute for Bulgarian Language „Prof. L. Andreychin” by researchers from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. Its core incorporates several electronic corpora, developed in the period 2001-2009 for the purposes of the two departments, but has been substantially expanded in the following years. The corpus reflects the state of the Bulgarian language (mainly in its written form) from the middle of ХХ c. (1945) until the present.
The enlargement of the BulNC has involved not only the amassing of Bulgarian texts, but also the compilation of parallel corpora with Bulgarian as a pivot language. This means that texts in other languages that are added to the corpus obligatory have a Bulgarian counterpart in the Bulgarian part of the corpus, which consists its core.
Currently (2014), the corpus core consists of approximately 1.2 billion words and more than 240,000 texts. So far 47 foreign languages have been included totalling about 4.2 billion words. Thus, the overall size of the corpus is about 5.4 billion words.
The corpus is supplied with three levels of annotation:
- a detailed metadata description – information for author, date of creation, date of publication, type, genre, domain, etc.;
- monolingual annotation – tokenisation, sentence splitting, POS tagging, lemmatisation, word sense annotation (with senses assigned from the Bulgarian WordNet);
- multilingual annotation – alignment at different levels, currently sentence and clause level for parts of the corpus.
A special corpus search system has been developed, which allows complex queries to be performed.
The tagset used in the annotation of the BulNC is available: here.
BulNC’s search system
Tyoe: written and spoken language; multi-lingual; general, with a number of specialised subcorpora; supplied with metadata description and multi-level linguistic annotation.
Size: more than 240,000 text samples distributed in 9 categories. Overall size: approximately 5.4 billion words.
Application: The Bulgarian National Corpus enables a number of applications in various linguistic areas: in computational linguistics; in lexicography; within theoretical studies of specific linguistic phenomena; for observations of the characteristics of individual language domains; for extracting exemplary sentences for the education in Bulgarian, etc.
Annotation: monolingual annotation: tokenisation, sentence splitting, POS annotation and disambiguation, and lemmatisation; word sense annotation with senses assigned from the Bulgarian WordNet; multi-lingual annotation: sentence and clause alignment for parts of the corpus; detailed metadata descriptio.;
Access: Free online access through the web search system of the Bulgarian National Corpus::
- free online search restricted to 30 hits for users without registration;
- full access to the results for registered users;
- free download of copyright-free texts for registered users.