EN BG

Language Resources

Wiki1000+ corpus with annotated MWEs

Wiki1000+ corpus with annotated MWEs

General description Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus. Compilation The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian…

The Bulgarian-English Sentence- and Clause-Aligned Corpus

The Bulgarian-English Sentence- and Clause-Aligned Corpus

General description The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English Parallel Corpus – a part of the Bulgarian National Corpus (BulNC) of approximately 260.7 million tokens for Bulgarian and 263.1 million tokens for English. The BulEnAC consists of 176,397 tokens for Bulgarian and 190,468 for English (366,865 tokens altogether). The BulEnAC comprises 14,667 Bulgarian sentences…

N-grams on Bulgarian National Corpus

N-grams on Bulgarian National Corpus

BgNgrams lists are extracted from the current version of the Bulgarian National Corpus (with a core Bulgarian part containing over 1.2 billion words). The n-grams involves both lemmas (n-gram lemma) and word forms (n-gram word form). n-grams can be 1-grams, 2-grams, 3-grams, 4-grams, 5-grams. The n-gram language models (1-5) are in the standard ARPA text and binary format.

Frequency Dictionaries

Frequency Dictionaries

General overview The Frequency Dictionaries are derived from the Bulgarian National Corpus (BulNC), which is the largest systematically created and representative corpus of Bulgarian. The Frequency Dictionaries reflect the frequency of occurrence of lexical items in the corpus (BulNC version: December 2011). The classification of the BulNC samples is based on their style, domain and genre. Texts are divided into…

Multilingual Dictionaries

Multilingual Dictionaries

The set of multilingual dictionaries covers all pairs of languages among the following: Bulgarian, English, German, Romanian, Greek, and Polish. The main source of the dictionaries is Wikipedia – translations of article titles and category labels. The dictionaries include single words, MWEs and phrases but are predominantly phrase-to-phrase. The following sets of dictionaries are included in the pack: • General…

Corpus-Extracted MWE Lists

Corpus-Extracted MWE Lists

The classification of multiword expressions (MWEs) developed by Baldwin et al. (Baldwin, T., C. Bannard, T. Tanaka, D. Widdows. An Empirical Model of Multiword Expression Decomposability. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. 2003) who distinguish between non-decomposable, idiosyncratically decomposable and simple decomposable MWEs is adopted. Further, we divide simple decomposable MWEs into categories…

Bulgarian-X language Parallel Corpus

Bulgarian-X language Parallel Corpus

The Bulgarian-X language Parallel Corpus (Bul-X-Cor) is a part of the Bulgarian National Corpus (BulNC). The Bulgarian National Corpus is designed as a uniform framework for texts of different modality (written – spoken), period (synchronic – diachronic), and number of languages (monolingual – parallel where one of the counterparts is Bulgarian). Any X-language in the corpus is equally treated with…

Bulgarian National Corpus

Bulgarian National Corpus

The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. It incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts. The Bulgarian…

Multiword Expression Dictionary for Bulgarian

Multiword Expression Dictionary for Bulgarian

The Bulgarian dictionary of MWEs includes 27,744 MWEs altogether which are divided into 13 categories based on their idyomaticity evaluated with respect to the following features: • whether the MWE is a named entity; • whether the MWE contains a reference to a named entity; • the degree to which the meaning of the MWE is compositional and transparent. The…

Copyright © 2015-2022 Department of computational linguistics. All rights reserved.