The Bulgarian National Corpus « Секция по компютърна лингвистика

Period: 2017 – 2019

Type of project: collective, long-term

Funding: budgetary (BAS)

Principal Investigator: Prof. Svetla Koeva

Participants: Prof. Svetla Koeva, Assist. Prof. T. Dimitrova, Assist. Prof. S. Leseva, Assist. Prof. M. Todorova, Ivelina Stoyanova, B. Rizov, L. Dzhakov, M. Yalamov.

Abstract:

The project aims to develop further the Bulgarian National Corpus (BulNC) by expanding its contents and improving its representativeness, balance and accessibility for linguistic research and lexicographic work on the vocabulary of the Bulgarian language. For the purposes of further expansion of BulNC (including the parallel multilingual corpora that are part of the corpus), automatic identification and collection of relevant documents. An important direction in improving the BulNC is the construction of a model of taxonomic classification for organising the documents to allow collection and classification of new types of texts and easy restructuring of the corpus. The automatic linguistic annotation of BulNC is an ongoing task. For the purposes of lexicographic research, the project employs a methodology for selecting corpus samples to be used in the search engine for different lexicographic tasks. The project focuses also on improving the opportunities for extraction of specialised monolingual and multilingual corpora.

The work on expanding BulNC and improving its accessibility is interdisciplinary and combines methods of linguistics and computational linguistics, lexicography and corpus linguistics, and others. The resources and applications support the development of socially oriented software solutions and technologies for summarisation of large documents in administration, media and libraries, automatic search for relevant documents in any given area, translation aids; a system of linguistic rules to develop applications for support of people with hearing disabilities; etc. The project is part of two priority areas of research at the Institute for Bulgarian Language – Theoretical Linguistics and Electronic Language Resources and Tools for Language Processing.

Presentation of the results: improved search engine for the Bulgarian National Corpus; dictionary of multiword expressions; improved system for neologism detection; improved system for quotation extraction from media content; research papers presenting theoretical research and practical implications of the compilation of the Bulgarian National Corpus and its applications.

The Bulgarian National Corpus

Bulgarian WordNet

Multilingual Image Corpus

Bulgarian National Corpus

Dictionary of Bulgarian Language, online implementation by DCL

META-SHARE – network of repositories of language data, tools and related web services

System for business intelligence, language resources provided by DCL.