EN BG

Bulgarian Brown Corpus



The Bulgarian Brown Corpus is a general static representative sample corpus of Bulgarian compiled at the Department of Computational Linguistics at the Institute for Bulgarian Language. It follows the methodology presented by Brown University, Providence, Rhode Island, USA and applied in the compilation of the famous Brown Corpus (Brown University Standard Corpus of Present-Day American English). It illustrates the linguistic usage of informative or fictional text types divided into categories according to stylistic, thematic and/or genre principles. In order to ensure good representativeness, balance and illustrativeness of the Bulgarian Brown Corpus, we rely on a preliminary structural model and taxonomy of text categories for which we sample appropriate texts.

Language: Bulgarian

Type: general representative monolingual sample text corpus

Състав: The Bulgarian Brown Corpus includes 500 texts distributed in 15 domain in 2 text categories – fiction and non-fiction. The length of each text is approximately 2000 words. The number of words varies as efforts were made to keep sentence boundaries. The total volume of the corpus is 1 001 286 words. Corpus samples are excerpts of texts published in the period 1990-2005, predominantly after 2000.

History: The first version of the corpus has been compiled in 2001-2002. It became apparent that some domains are not well covered so some of the original Brown Corpus principles were abandoned (e.g., some texts were not original but translations into Bulgarian, some texts are pre-1990). The experience gathered while compiling the first version of the corpus, as well as the significant increase in electronic publications in Bulgrian lead to the compilation of the second version in the period 2002-2005.

Annotation: The corpus is documented, normalised and edited

Use Terms:

Download:

➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)

PROJECTS

PARTICIPANTS

The following team has taken part in the compilation of the Bulgarian Brown Corpus:

  • Prof. Dr. Svetla Koeva (head of the project)
  • Assist. Prof. Dr Svetlozara Leseva, Dr Ivelina Stoyanova, Assoc. Prof. Dr. Ekaterina Tаrpomanova, Borislav Rizov, Nikola Obreshkov (compilation)

➥ Specific features of the Bulgarian Brown Corpus

➥ Main criteria of corpus compilation

➥ Classification

➥ Description of corpus samples


Specific features of the Bulgarian Brown Corpus

Representativeness is ensured using a random selection of texts distributed in homogeneous groups.

Each corpus sample in the Bulgarian Brown Corpus represents an excerpt from a text with a length of approximately 2000 words. The term ‘corpus sample’ distinguishes the whole text from the part of it included in the corpus. The Bulgarian corpus, following the model of the original Brown corpus, consists of 500 corpus samples with a total volume of 1,001,286 words. Despite striving to meet the requirement of approximately 2000 words, 136 samples in the corpus have a smaller size due to their genre.

Unlike the original Brown Corpus (Brown University Standard Corpus of Present-Day American English), which is built from texts published within one calendar year (1961) in order to reflect a relatively static state of the language, The Bulgarian Brown Corpus includes texts created or published in their first edition in a relatively long period of time – 1990 to 2005, with the main part of the texts published after 2000. This is due, on the one hand, to the fact that many of the texts were collected in electronic form from the internet where their date (year) of publication cannot be precisely determined, and, on the other hand, the categories of the Princeton Brown Corpus were not sufficiently covered in the Bulgarian sources and thus cannot be covered with texts published in a shorter period.


Top


Main criteria of corpus compilation (in priority order)

  1. Texts need to be original, not translations.
  2. Texts need to be recent – published after 1990, preferably after 2000.
  3. To adhere to the categories and subcategories of the original Brown Corpus with their prescribed number of corpus samples.

    Exceptions:

    • Subcategories in category F are redistributed.
    • Within categoiries A-C the division between daily and weekly editions has been disregarded.
  4. To have the text source available (valid by the date when the sample is added to the corpus).

    Exceptions: 20 corpus samples have no information about the source.

  5. Corpus entry is also included in the first version of the corpus.

    Exceptions:385 corpus samples are replaced in order to satisfy the first four criteria.

  6. Each corpus sample to be written by a single author.

    Exceptions: 46 corpus samples are authored by more than one person; 70 have no known author.

  7. Each corpus sample to be excerpted from a single text.

    Exceptions: 104 corpus samples include excerpts from more than one text (belonging to short genres).

  8. The number of words in each sample need to be at least 2000 words (ending at the first end of a sentence after the 2000th word).

    Exceptions: 136 have length of under 2000 words:

    • 57 samples contain 1990-1999 words;
    • 69 samples contain 1900-1989 words;
    • 10 samples contain 1900 words or less.

Top


Classification

The classification is based on the following features:

  • Type of text – informative or fiction;
  • Category (based on the text type, the domain and/or the genre);
  • Subcategory (based on the category, the sample length and the source);
  • Genre (it has no classificatory, only descriptory dunction).

Table. Classification of the Bulgarian Brown Corpus.

Category Subcategory Number of samples
I. Infoirmative texts
A. Massmedia: News Politics 14
Sports 7
Society 3
News 9
Economics 4
Culture 7
total 44
B. Massmedia: Editorials and analyses Institutional 10
Personal 10
Letters 7
total 27
C. Massmedia: Reviews Reviews 17
total 17
D. Religion Books 7
Massmedia 6
Short stories 4
total 17
E. Leisure Books 2
Massmedia 34
total 43
F. Popular Books 10
Massmedia 38
total 43
G. Documentaries Books 38
Massmedia 37
total 75
H. Administrative documents Government 24
Organisations 2
Industrial reports 2
Education 1
Industrial periodicals 1
total 30
J. Science Natural Sciences 12
Medicine 5
Mathematics 4
Social Sciences 14
Politology, Law, Education 15
Humanities 18
Technology 12
total 80
Total of informative texts 374
 

II. Fiction

K. Classical literature Novels 20
Short stories 9
total 29
L. Detective literature Novels 20
Short stories 4
total 24
M. Science fiction Novels 3
Short stories 3
total 6
N. Adventure literature Novels 15
Short stories 14
total 29
P. Romance literature Novels 14
Short stories 15
total 29
R. Humour literature Novels 3
Essays, etc. 6
total 9
Total of fiction texts 126
TOTAL 500

Extended categories due to changes in the domain distributions:

  1. Detective literature – this category also includes police / crime / action novels and short stories.
  2. Adventure literature – as there were no typical adventure texts in Bulgarian, we replaced these with fantasy novels and short stories (adventure fiction with fantastic elements) as well as psychological novels and short stories with adventure elements.

Top


Description of corpus samples

General information

The description of each corpus sample includes general information about the text and the category to which it belongs.

  1. File name;
  2. File path;
  3. Old file name and path – in case the file was also included in the first version of the corpus;
  4. Author information – number of authors, names, unknown author;
  5. Text information – one or more texts, title;
  6. Form of the text – written, oral;
  7. Number of words;
  8. Date of adding the text to the corpus – source data are valid on that date;
  9. Date (year) of creating the text or its first publication;
  10. Date (year) of publication of the current version of the text;
  11. Information about the source;
  12. Additional notes.

Full description

The fill descrioption of the Bulgarian Brown Corpus (in Bulgarian) can be downloaded as an MS Excel file.


Top


Download:

➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)

The resource provides search capabilities for linguistic research, educational and other purposes.

Parts of the Bulgarian Brown Corpus were used in the creation of BulSemCor and BulPosCor.

The study of the Bulgarian Brown Corpus poses questions and provides an environment for theoretical and practical study of various problems which are generally underrepresented in scientific research. Such are, for example, the questions related to research and assessment of the adequacy of the applied model, created in 1962-1963 mainly based on observations of printed American publications (not so much based on statistical analyses) at Brown University, for various modern purposes. The study of this problem poses a number of scientific tasks, such as the extent to which the criteria for the selection of the texts apply to the texts in Bulgarian, as well as the extent to which the printed and electronic texts fit into the same categories.

An interesting task is also the assessment of the relevance of the model for 2005 (the year of creation of the Bulgarian Brown Corpus, version 2). To this day, the question of how applicable statistical methods (based on quantitative analysis) are to creating a methodology for building corpora remains open.

When using the Bulgarian Brown Corpus in your research, please cite any of the following publications:

Koeva, S., D. Blagoeva (eds.). Ezikovi resursi i tehnologii za balgarski ezik. Sofia: BAS Academy Press, 2014, 310 p. ISBN: 978-954-322-797-6.
Ivelina Stoyanova, Svetla Koeva, Svetlozara Lesseva. Applying and analysing Brown corpus model for Bulgarian. Presentation at The Third Inter-Varietal Applied Corpus Studies (IVACS) group International Conference on “LANGUAGE AT THE INTERFACE” 23rd – 24th June 2006, Nottingham, UK.
Koeva, S., S. Leseva, I. Stoyanova, E. Tarpomanova, M. Todorova. Bulgarian Tagged Corpora. – In: Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, 2006, pp. 78 – 86.
Koeva, S., S. Leseva, M. Todorova. Bulgarian Sense Tagged Corpus. – In: Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages, 2006, pp. 79 – 87.
Copyright © 2015-2022 Department of computational linguistics. All rights reserved.