The Bulgarian Sense-Annotated Corpus (BulSemCor) is a structured corpus of texts in Bulgarian in which all words are assigned an appropriate sense from the Bulgarian WordNet. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.
Language: Bulgarian.
Type: general monolingual text corpus enriched with linguistic annotation.
Composition: 811 texts of 100+ words, divided into 15 categories of 2 types – fiction and informative texts; size of the source corpus – 101,062 tokens, size of the annotated corpus – 99 480 lexical units.
Annotation: tokenisation and sentence splitting; POSS-tagging and lemmatisation; word-sense disambiguation (each lexical unit is assigned the most appropriate sense from the Bulgarian WordNet in the relevant context) carried out by expert linguists.
Access:
- Free access for online search.
- Free download under a Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0).
Download:
BulSemCor is part of the semantically annotated corpora of the Global WordNet.
PROJECTS
- BulNet – a Lexical-Semantic Network of Bulgarian (state-funded project: 2005 – 2007; 2008 – 2010)
- Electronic Language Resources and Processing Tools (BulNet and FrameNet) (state-funded project: 2011 – 2013)
PARTICIPANTS
The following people participated in the development of the Bulgarian Sense-Annotated Corpus:
Prof. Dr. Svetla Koeva (head of the project)
Assist. Prof. Dr. Tsvetana Dimitrova, Assist. Prof. Dr. Hristina Kukova, Assist. Prof. Dr. Svetlozara Leseva, Assist. Prof. Dr. Maria Todorova, Assoc. Prof. Dr. Ekaterina Tаrpomanova (annotators)
Katya Alahverdzieva, Nikolay Radanov (part-time annotators)
Borislav Rizov (developer of the annotation programme)
Nikola Obreshkov (compilation of the source corpus)
General description
The Bulgarian Sense-Annotated Corpus (BulSemCor) is manually annotated according to the sense inventory of the Bulgarian WordNet. Its size is comparable to that of many of the existing semantically annotated corpora created for other languages. The semantic annotation consists in associating each lexical item in the corpus with exactly one synonym set (synset) in the Bulgarian WordNet. The selection of the best matching sense (the one that best describes the sense used in the particular context) among the suggested candidates is guided by a set of procedures that take into account the other synset members, the synset gloss (explanatory definition) and the place of the candidate synset in the WordNet structure, among others.
The number of annotated tokens is 99,480; the difference compared with the size of the original unannotated corpus (101,062 tokens) is due to the fact that some of the tokens are not linguistic items). 86,842 of them are simple words, and the remaining are multiword expressions (MWEs) – 5,797 MWEs (12,638 tokens).
The Bulgarian Sense-Annotated Corpus was developed within the state-funded project BulNet – A Lexical Semantic Network for Bulgarian (2005 – 2010). The annotation follows the methodology adopted in the creation of the SemCor Corpus (Miller 1995: Miller, G. A. Building Semantic Concordances: Disambiguation vs. Annotation. – In: AAAI Technical Report SS-95-01, 1995, рр. 92 – 94.) in combination with a number of specific principles described in Koeva (2010). The corpus for annotation was excerpted from the Bulgarian Brown Corpus, which is itself modelled on the Brown Corpus (Francis and Kucera 1979: Francis, N., H. Kucera. Manual of Information to Accompany a Standard Sample of Present-day Edited American English, for Use with Digital Computers. Department of Linguistics, Brown University, Providence, R. I., U.S.A., original ed. 1964, revised 1971, revised and augmented 1979). An important feature of BulSemCor is that the sample texts were selected using heuristic methods aimed at providing optimal coverage of ambiguous lexis.
The semantic annotation was performed using a specially developed linguistic annotation tool Chooser.
Characteristics of the texts included in the corpus
The representativeness of the Bulgarian Sense-Annotated Corpus is ensured by means of the fact that it inherits the structure of the Bulgarian Brown Corpus: а sample оf at least 100 words (expanded left and right to the beginning and the end of the relevant sentence) was excerpted from each of the 500 texts in the Bulgarian Brown Corpus. Each sample was selected according to the highest concentration of content words in a frequency dictionary compiled from two grammatically disambiguated corpora: the Bulgarian translation of 1984 by George Orwell and a corpus of texts from three thematic areas – economics, law and politics. In order to achieve an optimally balanced selection of words across parts of speech, different weights were assigned to nouns (0.4), verbs (0.3), adjectives (0.2) and adverbs (0.1).
Annotation coverage
Two of the most important features of BulSemCor are the uniform approach to the different lexical units and the principle of consistent and comprehensive annotation.
All lexical items, regardless of their structure (single words or multiword expressions) or function (content or closed-class words), are treated on a par and are annotated according to adopted general criteria.
The sense inventory for the annotation of the Bulgarian Sense-Annotated Corpus is provided by the Bulgarian WordNet, BulNet, which was chosen for a number of reasons: the granularity and comprehensiveness of the senses defined in it; its complex relational structure employed in various applications related to natural language processing; the mapping to the Princeton WordNet (and hence – to other wordnets), which ensures access to the corresponding senses in a large number of languages; the extensible annotation schemа allowing new senses to be encoded or edited in parallel with the annotation process.
Corpus format
The annotated files are stored in an xml
format. A word sense, represented by the word form which appears in the context, w
(word), and its citation form, l
(lemma), is uniquely determined by the value of the attribute s
(sense). All the components of multiword expressions are assigned the same value for the attribute p
(parent), for example:
<word l="финансово" p="-1529022516" s="107274521200" w="финансовото"/>
<word l="министерство" p="-1529022516" s="107274521200" w="министерство"/>
Sentence end is encoded by the attribute e
. There are also two system attributes: u
(user) and t
(annotation timestamp).
Annotated units inherit all the linguistic information associated with the corresponding synset:
- its part of speech,
- explanatory definition,
- usage examples,
- notes describing grammatical, semantic and pragmatic restrictions regarding one or more members of the synset or the synset as a whole,
- the set of semantic, morpho-semantic and extra-linguistic relations linking the relevant synset to other synsets in WordNet,
- the set of semantic and derivational relations associated with a given literal (synset member).
Download:
Parts of the Bulgarian Sense-Annotated Corpus have been used as a training and test corpus in the development of a probabilistic formalism and a word-sense disambiguation programme for the purposes of machine translation (Rizov 2009, резюме).
The study of BulSemCor poses questions and provides a medium for theoretical and applied research into a variety of poorly studied linguistic issues, such as the polysemy of closed-class words and multiword expressions. Semantic annotation and the expansion of BulNet with synsets based on the senses attested in BulSemCor pose a number of research questions related to the automatic recognition of linguistic units and their lexicographic description.
The annotated corpus provides a point of departure for the development of models for semantic analysis. For instance, the information about the semantic class of the annotated predicates and their arguments and adjuncts (the values of the semantic classes are inherited from WordNet along with the relations expressed by prepositions and conjunctions and the ontological type of the adverbials) enables the study and formalisation of the semantic relations among the participants in a given situation and facilitates the definition of cognitively valid selectional restrictions.
BulSemCor is free to use under the CC BY-SA 4.0 license. When using BulSemCor in your research, please cite any of the following publications: