EN BG
N-grams on Bulgarian National Corpus

N-grams on Bulgarian National Corpus

BgNgrams lists are extracted from the current version of the Bulgarian National Corpus (with a core Bulgarian part containing over 1.2 billion words). The n-grams involves both lemmas (n-gram lemma) and word forms (n-gram word form). n-grams can be 1-grams, 2-grams, 3-grams, 4-grams, 5-grams. The n-gram language models (1-5) are in the standard ARPA text and binary format.

Frequency Dictionaries

Frequency Dictionaries

General overview The Frequency Dictionaries are derived from the Bulgarian National Corpus (BulNC), which is the largest systematically created and representative corpus of Bulgarian. The Frequency Dictionaries reflect the frequency of occurrence of lexical items in the corpus (BulNC version: December 2011). The classification of the BulNC samples is based on their style, domain and genre. Texts are divided into…

Multilingual Dictionaries

Multilingual Dictionaries

The set of multilingual dictionaries covers all pairs of languages among the following: Bulgarian, English, German, Romanian, Greek, and Polish. The main source of the dictionaries is Wikipedia – translations of article titles and category labels. The dictionaries include single words, MWEs and phrases but are predominantly phrase-to-phrase. The following sets of dictionaries are included in the pack: • General…

Corpus-Extracted MWE Lists

Corpus-Extracted MWE Lists

The classification of multiword expressions (MWEs) developed by Baldwin et al. (Baldwin, T., C. Bannard, T. Tanaka, D. Widdows. An Empirical Model of Multiword Expression Decomposability. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. 2003) who distinguish between non-decomposable, idiosyncratically decomposable and simple decomposable MWEs is adopted. Further, we divide simple decomposable MWEs into categories…

Multiword Expression Dictionary for Bulgarian

Multiword Expression Dictionary for Bulgarian

The Bulgarian dictionary of MWEs includes 27,744 MWEs altogether which are divided into 13 categories based on their idyomaticity evaluated with respect to the following features: • whether the MWE is a named entity; • whether the MWE contains a reference to a named entity; • the degree to which the meaning of the MWE is compositional and transparent. The…

Copyright © 2015-2022 Department of computational linguistics. All rights reserved.