"Brown" corpus of Bulgarian

Home

The Bulgarian „Brown” Corpus is compiled in conformity with the methodology elaborated at the Brown university (Brown university, Providence, Rhode Island, USA) and used in the compilation of the well-known Brown Corpus of Standard American English. The Bulgarian „Brown” Corpus consists of 500 text samples distributed in 15 categories from two types of texts - fiction and informative prose. Samples' length is set at 2 000 words with the precise number of words varying, as the adopted methodology envisages respecting sentence boundaries. The corpus amounts to 1 001 286 words. The samples are excerpts from texts created or published for the first time in the period 1990-2005, the main part dated after 2000.

The corpus is supplied with the relevant documentation. Finally, checks for wrong replacement of Cyrillic letters with Latin characters, as well as spelling and punctuation checking were performed. The first version of the corpus was compiled in 2001-2002. Some of the principles underlying the Brown Corpus such as originality of the texts, recentness of creation, etc. had to be disregarded in order to provide sufficient coverage of all categories. The experience gained in creating first version along with the significant increase of publications made electronically available in the period 2002-2005 afforded and greatly facilitated the compilation of the second version of the corpus.