The Bulgarian Brown Corpus is a general static representative sample corpus of Bulgarian compiled at the Department of Computational Linguistics at the Institute for Bulgarian Language. It follows the methodology presented by Brown University, Providence, Rhode Island, USA and applied in the compilation of the famous Brown Corpus (Brown University Standard Corpus of Present-Day American English). It illustrates the linguistic usage of informative or fictional text types divided into categories according to stylistic, thematic and/or genre principles. In order to ensure good representativeness, balance and illustrativeness of the Bulgarian Brown Corpus, we rely on a preliminary structural model and taxonomy of text categories for which we sample appropriate texts.
Language: Bulgarian
Type: general representative monolingual sample text corpus
Състав: The Bulgarian Brown Corpus includes 500 texts distributed in 15 domain in 2 text categories – fiction and non-fiction. The length of each text is approximately 2000 words. The number of words varies as efforts were made to keep sentence boundaries. The total volume of the corpus is 1 001 286 words. Corpus samples are excerpts of texts published in the period 1990-2005, predominantly after 2000.
History: The first version of the corpus has been compiled in 2001-2002. It became apparent that some domains are not well covered so some of the original Brown Corpus principles were abandoned (e.g., some texts were not original but translations into Bulgarian, some texts are pre-1990). The experience gathered while compiling the first version of the corpus, as well as the significant increase in electronic publications in Bulgrian lead to the compilation of the second version in the period 2002-2005.
Annotation: The corpus is documented, normalised and edited
Use Terms:
- Free access to search online.
- Free download under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
Download:
➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)
PROJECTS
- BulNet – a Lexical-Semantic Network of Bulgarian (nationally funded project: 2005–2007; 2008–2010)
- Bulgarian National Corpus (funded within EU framework and national programmes: 2010 – 2013)
- Electronic Language Resources and Processing Tools (BulNet and FrameNet) (nationally funded project: 2011 – 2013)
PARTICIPANTS
The following team has taken part in the compilation of the Bulgarian Brown Corpus:
- Prof. Dr. Svetla Koeva (head of the project)
- Assist. Prof. Dr Svetlozara Leseva, Dr Ivelina Stoyanova, Assoc. Prof. Dr. Ekaterina Tаrpomanova, Borislav Rizov, Nikola Obreshkov (compilation)
➥ Specific features of the Bulgarian Brown Corpus
➥ Main criteria of corpus compilation
➥ Description of corpus samples
Specific features of the Bulgarian Brown Corpus
Representativeness is ensured using a random selection of texts distributed in homogeneous groups.
Each corpus sample in the Bulgarian Brown Corpus represents an excerpt from a text with a length of approximately 2000 words. The term ‘corpus sample’ distinguishes the whole text from the part of it included in the corpus. The Bulgarian corpus, following the model of the original Brown corpus, consists of 500 corpus samples with a total volume of 1,001,286 words. Despite striving to meet the requirement of approximately 2000 words, 136 samples in the corpus have a smaller size due to their genre.
Unlike the original Brown Corpus (Brown University Standard Corpus of Present-Day American English), which is built from texts published within one calendar year (1961) in order to reflect a relatively static state of the language, The Bulgarian Brown Corpus includes texts created or published in their first edition in a relatively long period of time – 1990 to 2005, with the main part of the texts published after 2000. This is due, on the one hand, to the fact that many of the texts were collected in electronic form from the internet where their date (year) of publication cannot be precisely determined, and, on the other hand, the categories of the Princeton Brown Corpus were not sufficiently covered in the Bulgarian sources and thus cannot be covered with texts published in a shorter period.
Main criteria of corpus compilation (in priority order)
- Texts need to be original, not translations.
- Texts need to be recent – published after 1990, preferably after 2000.
- To adhere to the categories and subcategories of the original Brown Corpus with their prescribed number of corpus samples.
Exceptions:
- Subcategories in category F are redistributed.
- Within categoiries A-C the division between daily and weekly editions has been disregarded.
- To have the text source available (valid by the date when the sample is added to the corpus).
Exceptions: 20 corpus samples have no information about the source.
- Corpus entry is also included in the first version of the corpus.
Exceptions:385 corpus samples are replaced in order to satisfy the first four criteria.
- Each corpus sample to be written by a single author.
Exceptions: 46 corpus samples are authored by more than one person; 70 have no known author.
- Each corpus sample to be excerpted from a single text.
Exceptions: 104 corpus samples include excerpts from more than one text (belonging to short genres).
- The number of words in each sample need to be at least 2000 words (ending at the first end of a sentence after the 2000th word).
Exceptions: 136 have length of under 2000 words:
- 57 samples contain 1990-1999 words;
- 69 samples contain 1900-1989 words;
- 10 samples contain 1900 words or less.
Classification
The classification is based on the following features:
- Type of text – informative or fiction;
- Category (based on the text type, the domain and/or the genre);
- Subcategory (based on the category, the sample length and the source);
- Genre (it has no classificatory, only descriptory dunction).
Table. Classification of the Bulgarian Brown Corpus.
Category | Subcategory | Number of samples |
I. Infoirmative texts | ||
A. Massmedia: News | Politics | 14 |
Sports | 7 | |
Society | 3 | |
News | 9 | |
Economics | 4 | |
Culture | 7 | |
total | 44 | |
B. Massmedia: Editorials and analyses | Institutional | 10 |
Personal | 10 | |
Letters | 7 | |
total | 27 | |
C. Massmedia: Reviews | Reviews | 17 |
total | 17 | |
D. Religion | Books | 7 |
Massmedia | 6 | |
Short stories | 4 | |
total | 17 | |
E. Leisure | Books | 2 |
Massmedia | 34 | |
total | 43 | |
F. Popular | Books | 10 |
Massmedia | 38 | |
total | 43 | |
G. Documentaries | Books | 38 |
Massmedia | 37 | |
total | 75 | |
H. Administrative documents | Government | 24 |
Organisations | 2 | |
Industrial reports | 2 | |
Education | 1 | |
Industrial periodicals | 1 | |
total | 30 | |
J. Science | Natural Sciences | 12 |
Medicine | 5 | |
Mathematics | 4 | |
Social Sciences | 14 | |
Politology, Law, Education | 15 | |
Humanities | 18 | |
Technology | 12 | |
total | 80 | |
Total of informative texts | 374 | |
II. Fiction |
||
K. Classical literature | Novels | 20 |
Short stories | 9 | |
total | 29 | |
L. Detective literature | Novels | 20 |
Short stories | 4 | |
total | 24 | |
M. Science fiction | Novels | 3 |
Short stories | 3 | |
total | 6 | |
N. Adventure literature | Novels | 15 |
Short stories | 14 | |
total | 29 | |
P. Romance literature | Novels | 14 |
Short stories | 15 | |
total | 29 | |
R. Humour literature | Novels | 3 |
Essays, etc. | 6 | |
total | 9 | |
Total of fiction texts | 126 | |
TOTAL | 500 |
Extended categories due to changes in the domain distributions:
- Detective literature – this category also includes police / crime / action novels and short stories.
- Adventure literature – as there were no typical adventure texts in Bulgarian, we replaced these with fantasy novels and short stories (adventure fiction with fantastic elements) as well as psychological novels and short stories with adventure elements.
Description of corpus samples
General information
The description of each corpus sample includes general information about the text and the category to which it belongs.
- File name;
- File path;
- Old file name and path – in case the file was also included in the first version of the corpus;
- Author information – number of authors, names, unknown author;
- Text information – one or more texts, title;
- Form of the text – written, oral;
- Number of words;
- Date of adding the text to the corpus – source data are valid on that date;
- Date (year) of creating the text or its first publication;
- Date (year) of publication of the current version of the text;
- Information about the source;
- Additional notes.
Full description
The fill descrioption of the Bulgarian Brown Corpus (in Bulgarian) can be downloaded as an MS Excel file.
Copyright of corpus samples
The law allows for the free use of texts under the conditions stated in paragraph 24 of the Copyright Law (last changes published in issue 77 / 2002)
Without the consent of the copyright holder and without payment of remuneration, the followin permissible:
- temporary reproduction of works if it is of a transitory or incidental nature, has no independent meaning, constitutes an indivisible and essential part of the technical process and is made for the sole purpose of allowing:
- intermediate network transmission, or
- other permitted use of work;
- the use of quotations from already published works of other persons for the purposes of reviews and overviews when indicating the source and the name of the author, unless this is impossible; the citation must conform to common practice and be to the extent justified by the purpose;
- the use of parts of published works or of a small number of works in other works in a volume necessary for analysis, commentary or other type of scientific research; such use is permitted only for scientific and educational purposes with reference to the source and the name of the author, unless this is impossible…
Copyright Law (issue 56 / 29/06/1993; issue 63 / 1994; issue 10 / 1998; issue 28 / 2000; issue 107 / 2000; issue 77 / 9/08/2002)
Neither the corpus as a whole nor individual corpus samples will be republished. Only the corpus description and information extraction programs are publicly published and available for use.
The description of the corpus and the programs for processing and extracting data from it are distributed free of charge, not for commercial use, but only for research and educational purposes.
Copyright on the Bulgarian Brown Corpus and its description
Copyright on collections, anthologies, bibliographies and databases (issue 28 / 2000)
Paragraph 11.
(1) The copyright on collections, anthologies, bibliographies, databases and the like belongs to the person who has carried out the selection or arrangement of the included works and/or materials, unless otherwise stipulated in a contract.
The copyright on the individual parts included in such a work, which have the character of works of literature, art and science, belongs to their authors.
(2) The inclusion of works or parts of them in such a work requires the consent of their authors, unless the law provides otherwise.
Download:
➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)
The resource provides search capabilities for linguistic research, educational and other purposes.
Parts of the Bulgarian Brown Corpus were used in the creation of BulSemCor and BulPosCor.
The study of the Bulgarian Brown Corpus poses questions and provides an environment for theoretical and practical study of various problems which are generally underrepresented in scientific research. Such are, for example, the questions related to research and assessment of the adequacy of the applied model, created in 1962-1963 mainly based on observations of printed American publications (not so much based on statistical analyses) at Brown University, for various modern purposes. The study of this problem poses a number of scientific tasks, such as the extent to which the criteria for the selection of the texts apply to the texts in Bulgarian, as well as the extent to which the printed and electronic texts fit into the same categories.
An interesting task is also the assessment of the relevance of the model for 2005 (the year of creation of the Bulgarian Brown Corpus, version 2). To this day, the question of how applicable statistical methods (based on quantitative analysis) are to creating a methodology for building corpora remains open.
When using the Bulgarian Brown Corpus in your research, please cite any of the following publications: