Content

Features of the Bulgarian corpus.

Each corpus sample (corpus unit, text sample) is an excerpt(s) from a text (texts) which length is fixed at 2 000 words with the precise number of words varying, as the adopted methodology envisages keeping sentence boundaries. The term 'corpus sample' and its synonyms are used to refer to that part of any textual matter included in the corpus. The "Brown" Corpus of Bulgarian consists of 500 corpus samples and totals to 1 001 286 words. Despite the intention to make samples 2 000+ words, 136 samples contain less than 2 000 words.

The Brown Corpus of Standard American English consists of one million words sampled from texts published in 1961 and represents a relatively homogeneous state of the language. The samples in the Bulgarian corpus were first published in the period 1990-2005, most of them - after 2000. This extended time period was imposed by the impossibility to otherwise cover all the Brown Corpus categories, since the corpus units were taken from texts published on the Internet and some of them had no specified year of publication.


Methodology for preparing the corpus (in order of priority):

  1. Originality of the samples, (i.e. not translations).
  2. Recentness of creation - the corpus excerpts were taken from texts created and published after 1990, and if possible - after 2000.
  3. Classification into categories and subcategories and number of texts within categories should match those of the Brown Corpus of Standard American English.
    Exceptions:
  4. Availability of the source of the text (in this case a web address valid on the date on which the text was included in the corpus)
    Exceptions: 20 corpus samples with no specified source.
  5. Featuring of the corpus sample in the first version of the "Brown" Corpus of Bulgarian.
    Exceptions: 385 corpus samples violating conditions 1-4 replaced with new ones.
  6. The corpus sample should be an excerpt from a text or texts written by a single author or one team of authors.
    Exceptions: 92 corpus samples composed of more than one text from more than one author (or team of authors); 70 corpus samples with no author specified.
  7. The corpus sample should be an excerpt from a single text.
    Exceptions: 104 corpus samples composed of more than one text, including 88 texts from more than one author. All of them are from periodicals and represent short genres such as new items.
  8. Length of 2 000+ words (ending at the first sentence after 2 000 words).
    Exceptions: 136 samples with less than 2 000 words, including: