The project aims to develop further the Bulgarian National Corpus (BulNC) by expanding its contents and improving its representativeness, balance and accessibility for linguistic research and lexicographic work on the vocabulary of the Bulgarian language.
For the purposes of further expansion of BulNC (including the parallel multilingual corpora that are part of the corpus), the team works on web search and automatic identification and collection of relevant documents. An important direction in improving the BulNC is the construction of a model of taxonomic classification for organising the documents to allow collection and classification of new types of texts and easy restructuring of the corpus. The automatic linguistic annotation of BulNC is an ongoing task. For the purposes of lexicographic research, the project employs a methodology for selecting corpus samples to be used in the search engine for different lexicographic tasks (http://search.dcl.bas.bg). The project focuses also on improving the opportunities for extraction of specialised monolingual and multilingual corpora.
Research topics, applicability, and scientific and social impact of the project meet the priorities set in the National Strategy of Scientific Research to 2020 and the EU Framework Programme for Research and Innovation Horizon 2020. The work on expanding BulNC and improving its accessibility is interdisciplinary and combines methods of linguistics and computational linguistics, lexicography and corpus linguistics, and others. The resources and applications support the development of socially oriented software solutions and technologies for summarisation of large documents in administration, media and libraries, automatic search for relevant documents in any given area, translation aids; a system of linguistic rules to develop applications for support of people with hearing disabilities; etc. The project is part of two priority areas of research at the Institute for Bulgarian Language – Theoretical Linguistics and Electronic Language Resources and Tools for Their Processing.
Type of project: collective, long-term
Funding: budgetary (BAS), bilateral project with the Czech Academy of Sciences; a project funded by the Bulgarian National Science Fund
Principal Investigator: Prof. Svetla Koeva, Ph.D., Assoc. Prof. Sia Kolkovska, Ph.D.
Participants: Prof. Svetla Koeva, Prof. Sia Kolkovska, Prof. D. Blagoeva, Assist. Prof. T. Dimitrova, Assist. Prof. S. Leseva, Ivelina Stoyanova, B. Rizov, Assist. Prof. M. Todorova, L. Dzhakov, M. Yalamov, Assist. Prof. T. Georgieva, Assist. Prof. N. Kostova, Assist. Prof. A. Atanasova
➢ Bulgarian National Corpus. Collective project. Participants: Prof. Svetla Koeva, Prof. Sia Kolkovska, Prof. D. Blagoeva, Assist. Prof. T. Dimitrova, Assist. Prof. S. Leseva, I. Stoyanova, B. Rizov, Assist. prof. M. Todorova, L. Dzhakov, M. Yalamov, Assist. Prof. T. Georgieva, Assist. Prof. N. Kostova, Assist. Prof. A. Atanasova. Period: 2014-2016
➢ Automatic identification of Named Entities in Bulgarian and Czech. Collective bilateral project with the Institute for Czech Language at the Czech Academy of Sciences. Principal investigator: Assist. Prof. T. Dimitrova, Ph.D. Participants: Prof. S. Koeva, Assist. Prof. T. Dimitrova. Period: 2014-2016
Presentation of the results: annotated corpus of Bulgarian, annotated multilingual parallel corpus, search engine, scientific publications.