Georg Rehm
Prof. Dr. Georg Rehm works as a Principal Researcher in the Speech and Language Technology Lab at the German Research Center for Artificial Intelligence (DFKI), in Berlin. Currently, Georg Rehm is the Coordinator of QURATOR (BMBF, 2018-2022) and European Language Grid (ELG; EU, 2019-2022). Furthermore, he is the Co-coordinator of European Language Equality (ELE; EU, 2021-2022) and involved, as a…
Registration (ELG / MIC 21 Workshops)
➥ Information about the event Thank you for your interest in the First Bulgarian dissemination event of the European Language Grid (ELG). The event will be streamed live in the YouTube channel of the Department of Computational Linguistics. Participants can use the chat for comments and questions which will be addressed during the Q&A sessions. Please take part in our…
Wiki1000+ corpus with annotated MWEs
General description Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus. Compilation The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian…
N-grams on Bulgarian National Corpus
BgNgrams lists are extracted from the current version of the Bulgarian National Corpus (with a core Bulgarian part containing over 1.2 billion words). The n-grams involves both lemmas (n-gram lemma) and word forms (n-gram word form). n-grams can be 1-grams, 2-grams, 3-grams, 4-grams, 5-grams. The n-gram language models (1-5) are in the standard ARPA text and binary format.
Frequency Dictionaries
General overview The Frequency Dictionaries are derived from the Bulgarian National Corpus (BulNC), which is the largest systematically created and representative corpus of Bulgarian. The Frequency Dictionaries reflect the frequency of occurrence of lexical items in the corpus (BulNC version: December 2011). The classification of the BulNC samples is based on their style, domain and genre. Texts are divided into…
Multilingual Dictionaries
The set of multilingual dictionaries covers all pairs of languages among the following: Bulgarian, English, German, Romanian, Greek, and Polish. The main source of the dictionaries is Wikipedia – translations of article titles and category labels. The dictionaries include single words, MWEs and phrases but are predominantly phrase-to-phrase. The following sets of dictionaries are included in the pack: • General…