Metadata

The metadata description of the texts in the BulNC is stored into 27 categories that are compliant with the established standards, although defined for the particular needs of the BulNC.

filename	path_to_file	date_added_to_corpus
author_info	author	translator_info
translator	text_info	title
year_of_creation	publishing_date	source_type
source	translated	medium
number_of_words	style	genre
genre_info	domain1	domain2
domain_info	notes	keywords
languages	quality	accessibility

Metadata categories in the BulNC description scheme.

The metadata scheme can be represented as a graph where the nodes are associated with metadata categories and the arcs with binary relations between the nodes, such as style, domain, and genre, etc. For some metadata relations, for instance style, the metadata categories are predefined; for others, such as author, the categories are an open set. The representation is simplified, e.g. authorship of the text is recorded only once for all Bulgarian and samples in other languages. As a further advantage, graph representation allows flexible extension with new relations and categories and shows where merging or splitting categories is permissible.

Example of graph representation of corpus metadata

The metadata are as detailed as possible in order to ensure easy text classification, corpus restructuring and evaluation, derivation of subcorpora based on a set of criteria (e.g., year of publication, domain). Some of the metadata categories, labelled with _info, are optional and contain additional details about the main category. A multiple domain description was also included to cater for the description of texts which have mixed domain features. So far, extensive metadata are provided for the Bulgarian and the English part of the BulNC, while the corresponding texts from the other languages share the common metadata (author, title, etc.) and inherit the classificatory information for style, domain, and genre.