EN BG
Wiki1000+ corpus with annotated MWEs

Wiki1000+ corpus with annotated MWEs

General description Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus. Compilation The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian…

The Bulgarian-English Sentence- and Clause-Aligned Corpus

The Bulgarian-English Sentence- and Clause-Aligned Corpus

General description The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English Parallel Corpus – a part of the Bulgarian National Corpus (BulNC) of approximately 260.7 million tokens for Bulgarian and 263.1 million tokens for English. The BulEnAC consists of 176,397 tokens for Bulgarian and 190,468 for English (366,865 tokens altogether). The BulEnAC comprises 14,667 Bulgarian sentences…

Bulgarian-X language Parallel Corpus

Bulgarian-X language Parallel Corpus

The Bulgarian-X language Parallel Corpus (Bul-X-Cor) is a part of the Bulgarian National Corpus (BulNC). The Bulgarian National Corpus is designed as a uniform framework for texts of different modality (written – spoken), period (synchronic – diachronic), and number of languages (monolingual – parallel where one of the counterparts is Bulgarian). Any X-language in the corpus is equally treated with…

Bulgarian National Corpus

Bulgarian National Corpus

The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. It incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts. The Bulgarian…

Copyright © 2015-2022 Department of computational linguistics. All rights reserved.