Download

➥ Corpora for Download

The following precompiled subcorpora are available for download:

  • Administrative corpus of official EU documents – parallel, in 23 languages with largest corpora in English, German, Romanian, Polish and Greek.
  • Journalistic corpus from SETimes.com – parallel, in 9 Balkan languages (Bulgarian, Romanian, Macedonian, Serbian, Albanian, Greek, Turkish, Croatian, Bosnian) and English.
  • Popular Science from Wikipedia – in Bulgarian..
  • Administrative/Science corpus with medical texts from the EMEA – parallel, in 23 languages.

At present, the corpora are provided in plain text format, but upon request an annotated version may become available. For more details, please contact us: bulnc@dcl.bas.bg.

Requests for subcorpora extraction

The requests include texts which can be distributed in compliance with the Copyright laws. The main corpora offered for download are listed above

 

The requests can be based on some of the following criteria:

  • Style, domain and/or genre;
  • Period of time;
  • Language(s);
  • Size of the text samples and the corpus; etc.

For more details, please contact us: bulnc@dcl.bas.bg.