Bulgarian Brown Corpus « Секция по компютърна лингвистика

Home
Description
Copyright
Applications
Publications
Links

The Bulgarian Brown Corpus is a general static representative sample corpus of Bulgarian compiled at the Department of Computational Linguistics at the Institute for Bulgarian Language. It follows the methodology presented by Brown University, Providence, Rhode Island, USA and applied in the compilation of the famous Brown Corpus (Brown University Standard Corpus of Present-Day American English). It illustrates the linguistic usage of informative or fictional text types divided into categories according to stylistic, thematic and/or genre principles. In order to ensure good representativeness, balance and illustrativeness of the Bulgarian Brown Corpus, we rely on a preliminary structural model and taxonomy of text categories for which we sample appropriate texts.

Language: Bulgarian

Type: general representative monolingual sample text corpus

Състав: The Bulgarian Brown Corpus includes 500 texts distributed in 15 domain in 2 text categories – fiction and non-fiction. The length of each text is approximately 2000 words. The number of words varies as efforts were made to keep sentence boundaries. The total volume of the corpus is 1 001 286 words. Corpus samples are excerpts of texts published in the period 1990-2005, predominantly after 2000.

History: The first version of the corpus has been compiled in 2001-2002. It became apparent that some domains are not well covered so some of the original Brown Corpus principles were abandoned (e.g., some texts were not original but translations into Bulgarian, some texts are pre-1990). The experience gathered while compiling the first version of the corpus, as well as the significant increase in electronic publications in Bulgrian lead to the compilation of the second version in the period 2002-2005.

Annotation: The corpus is documented, normalised and edited

Use Terms:

Free access to search online.
Free download under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Download:

➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)

PROJECTS

BulNet – a Lexical-Semantic Network of Bulgarian (nationally funded project: 2005–2007; 2008–2010)
Bulgarian National Corpus (funded within EU framework and national programmes: 2010 – 2013)
Electronic Language Resources and Processing Tools (BulNet and FrameNet) (nationally funded project: 2011 – 2013)

PARTICIPANTS

The following team has taken part in the compilation of the Bulgarian Brown Corpus:

Prof. Dr. Svetla Koeva (head of the project)
Assist. Prof. Dr Svetlozara Leseva, Dr Ivelina Stoyanova, Assoc. Prof. Dr. Ekaterina Tаrpomanova, Borislav Rizov, Nikola Obreshkov (compilation)

➥ Specific features of the Bulgarian Brown Corpus

➥ Main criteria of corpus compilation

➥ Classification

➥ Description of corpus samples

Specific features of the Bulgarian Brown Corpus

Representativeness is ensured using a random selection of texts distributed in homogeneous groups.

Each corpus sample in the Bulgarian Brown Corpus represents an excerpt from a text with a length of approximately 2000 words. The term ‘corpus sample’ distinguishes the whole text from the part of it included in the corpus. The Bulgarian corpus, following the model of the original Brown corpus, consists of 500 corpus samples with a total volume of 1,001,286 words. Despite striving to meet the requirement of approximately 2000 words, 136 samples in the corpus have a smaller size due to their genre.

Unlike the original Brown Corpus (Brown University Standard Corpus of Present-Day American English), which is built from texts published within one calendar year (1961) in order to reflect a relatively static state of the language, The Bulgarian Brown Corpus includes texts created or published in their first edition in a relatively long period of time – 1990 to 2005, with the main part of the texts published after 2000. This is due, on the one hand, to the fact that many of the texts were collected in electronic form from the internet where their date (year) of publication cannot be precisely determined, and, on the other hand, the categories of the Princeton Brown Corpus were not sufficiently covered in the Bulgarian sources and thus cannot be covered with texts published in a shorter period.

Top

Main criteria of corpus compilation (in priority order)

Texts need to be original, not translations.
Texts need to be recent – published after 1990, preferably after 2000.
To adhere to the categories and subcategories of the original Brown Corpus with their prescribed number of corpus samples.
Exceptions:
- Subcategories in category F are redistributed.
- Within categoiries A-C the division between daily and weekly editions has been disregarded.
To have the text source available (valid by the date when the sample is added to the corpus).
Exceptions: 20 corpus samples have no information about the source.
Corpus entry is also included in the first version of the corpus.
Exceptions:385 corpus samples are replaced in order to satisfy the first four criteria.
Each corpus sample to be written by a single author.
Exceptions: 46 corpus samples are authored by more than one person; 70 have no known author.
Each corpus sample to be excerpted from a single text.
Exceptions: 104 corpus samples include excerpts from more than one text (belonging to short genres).
The number of words in each sample need to be at least 2000 words (ending at the first end of a sentence after the 2000th word).
Exceptions: 136 have length of under 2000 words:
- 57 samples contain 1990-1999 words;
- 69 samples contain 1900-1989 words;
- 10 samples contain 1900 words or less.

Top

Classification

The classification is based on the following features:

Type of text – informative or fiction;
Category (based on the text type, the domain and/or the genre);
Subcategory (based on the category, the sample length and the source);
Genre (it has no classificatory, only descriptory dunction).

Table. Classification of the Bulgarian Brown Corpus.

Category	Subcategory	Number of samples
I. Infoirmative texts
A. Massmedia: News	Politics	14
	Sports	7
	Society	3
	News	9
	Economics	4
	Culture	7
	total	44
B. Massmedia: Editorials and analyses	Institutional	10
	Personal	10
	Letters	7
	total	27
C. Massmedia: Reviews	Reviews	17
C. Massmedia: Reviews	total	17
D. Religion	Books	7
	Massmedia	6
	Short stories	4
	total	17
E. Leisure	Books	2
	Massmedia	34
	total	43
F. Popular	Books	10
	Massmedia	38
	total	43
G. Documentaries	Books	38
	Massmedia	37
	total	75
H. Administrative documents	Government	24
	Organisations	2
	Industrial reports	2
	Education	1
	Industrial periodicals	1
	total	30
J. Science	Natural Sciences	12
	Medicine	5
	Mathematics	4
	Social Sciences	14
	Politology, Law, Education	15
	Humanities	18
	Technology	12
	total	80
Total of informative texts		374
II. Fiction
K. Classical literature	Novels	20
	Short stories	9
	total	29
L. Detective literature	Novels	20
	Short stories	4
	total	24
M. Science fiction	Novels	3
	Short stories	3
	total	6
N. Adventure literature	Novels	15
	Short stories	14
	total	29
P. Romance literature	Novels	14
	Short stories	15
	total	29
R. Humour literature	Novels	3
	Essays, etc.	6
	total	9
Total of fiction texts		126
TOTAL		500

Extended categories due to changes in the domain distributions:

Detective literature – this category also includes police / crime / action novels and short stories.
Adventure literature – as there were no typical adventure texts in Bulgarian, we replaced these with fantasy novels and short stories (adventure fiction with fantastic elements) as well as psychological novels and short stories with adventure elements.

Top

Description of corpus samples

General information

The description of each corpus sample includes general information about the text and the category to which it belongs.

File name;
File path;
Old file name and path – in case the file was also included in the first version of the corpus;
Author information – number of authors, names, unknown author;
Text information – one or more texts, title;
Form of the text – written, oral;
Number of words;
Date of adding the text to the corpus – source data are valid on that date;
Date (year) of creating the text or its first publication;
Date (year) of publication of the current version of the text;
Information about the source;
Additional notes.

Full description

The fill descrioption of the Bulgarian Brown Corpus (in Bulgarian) can be downloaded as an MS Excel file.

Top

Copyright of corpus samples

The law allows for the free use of texts under the conditions stated in paragraph 24 of the Copyright Law (last changes published in issue 77 / 2002)

Without the consent of the copyright holder and without payment of remuneration, the followin permissible:

temporary reproduction of works if it is of a transitory or incidental nature, has no independent meaning, constitutes an indivisible and essential part of the technical process and is made for the sole purpose of allowing:
- intermediate network transmission, or
- other permitted use of work;
the use of quotations from already published works of other persons for the purposes of reviews and overviews when indicating the source and the name of the author, unless this is impossible; the citation must conform to common practice and be to the extent justified by the purpose;
the use of parts of published works or of a small number of works in other works in a volume necessary for analysis, commentary or other type of scientific research; such use is permitted only for scientific and educational purposes with reference to the source and the name of the author, unless this is impossible…

Copyright Law (issue 56 / 29/06/1993; issue 63 / 1994; issue 10 / 1998; issue 28 / 2000; issue 107 / 2000; issue 77 / 9/08/2002)

Neither the corpus as a whole nor individual corpus samples will be republished. Only the corpus description and information extraction programs are publicly published and available for use.

The description of the corpus and the programs for processing and extracting data from it are distributed free of charge, not for commercial use, but only for research and educational purposes.

Copyright on the Bulgarian Brown Corpus and its description

Paragraph 11.

(1) The copyright on collections, anthologies, bibliographies, databases and the like belongs to the person who has carried out the selection or arrangement of the included works and/or materials, unless otherwise stipulated in a contract.

The copyright on the individual parts included in such a work, which have the character of works of literature, art and science, belongs to their authors.

(2) The inclusion of works or parts of them in such a work requires the consent of their authors, unless the law provides otherwise.

Download:

➥ Bulgarian Brown Corpus: texts and metadata | ➥ Metadata ONLY (.xlsx format)

The resource provides search capabilities for linguistic research, educational and other purposes.

Parts of the Bulgarian Brown Corpus were used in the creation of BulSemCor and BulPosCor.

The study of the Bulgarian Brown Corpus poses questions and provides an environment for theoretical and practical study of various problems which are generally underrepresented in scientific research. Such are, for example, the questions related to research and assessment of the adequacy of the applied model, created in 1962-1963 mainly based on observations of printed American publications (not so much based on statistical analyses) at Brown University, for various modern purposes. The study of this problem poses a number of scientific tasks, such as the extent to which the criteria for the selection of the texts apply to the texts in Bulgarian, as well as the extent to which the printed and electronic texts fit into the same categories.

An interesting task is also the assessment of the relevance of the model for 2005 (the year of creation of the Bulgarian Brown Corpus, version 2). To this day, the question of how applicable statistical methods (based on quantitative analysis) are to creating a methodology for building corpora remains open.

When using the Bulgarian Brown Corpus in your research, please cite any of the following publications:


Koeva, S., D. Blagoeva (eds.). Ezikovi resursi i tehnologii za balgarski ezik. Sofia: BAS Academy Press, 2014, 310 p. ISBN: 978-954-322-797-6.			@BOOK{2014-Ezikovi-resursi, editor = {Св. Коева and Д. Благоева}, title = {{Езикови ресурси и технологии за български език}}, year = 2014, pages = {310}, publisher = {{София: Академично издателство „Проф. Марин Дринов“}}, ISBN = {{978-954-322-797-6}}, } }
Ivelina Stoyanova, Svetla Koeva, Svetlozara Lesseva. Applying and analysing Brown corpus model for Bulgarian. Presentation at The Third Inter-Varietal Applied Corpus Studies (IVACS) group International Conference on “LANGUAGE AT THE INTERFACE” 23rd – 24th June 2006, Nottingham, UK.			@MISC{2006-Applying-and-analysing-Brown, editor = {Ivelina Stoyanova and Svetla Koeva and Svetlozara Lesseva}, title = {{Applying and analysing Brown corpus model for Bulgarian (Presentation)}}, year = 2006, venue = {{The Third Inter-Varietal Applied Corpus Studies (IVACS) group International Conference on “LANGUAGE AT THE INTERFACE” 23rd – 24th June 2006, Nottingham, UK}}, } }
Koeva, S., S. Leseva, I. Stoyanova, E. Tarpomanova, M. Todorova. Bulgarian Tagged Corpora. – In: Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, 2006, pp. 78 – 86.			@INPROCEEDINGS{2006-Bulgarian-Tagged-Corpora, author = {S. Koeva and S. Leseva and I. Stoyanova and E. Tarpomanova and M. Todorova}, title = {{Bulgarian Tagged Corpora}}, year = 2006, pages = {78 — 86}, booktitle = {{Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages}}, }
Koeva, S., S. Leseva, M. Todorova. Bulgarian Sense Tagged Corpus. – In: Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages, 2006, pp. 79 – 87.			@INPROCEEDINGS{2006-Bulgarian-Sense-Tagged-Co, author = {S. Koeva and S. Leseva and M. Todorova}, title = {{Bulgarian Sense Tagged Corpus}}, year = 2006, pages = {79 — 87}, booktitle = {{Proceedings of the 5th SALTMIL Workshop on Minority Languages: Strategies for Developing Machine Translation for Minority Languages}}, }