The metadata description of the texts in the BulNC is stored into 27 categories that are compliant with the established standards, although defined for the particular needs of the BulNC.


filename path_to_file date_added_to_corpus
author_info author translator_info
translator text_info title
year_of_creation publishing_date source_type
source translated medium
number_of_words style genre
genre_info domain1 domain2
domain_info notes keywords
languages quality accessibility

Metadata categories in the BulNC description scheme.

The metadata scheme can be represented as a graph where the nodes are associated with metadata categories and the arcs with binary relations between the nodes, such as style, domain, and genre, etc. For some metadata relations, for instance style, the metadata categories are predefined; for others, such as author, the categories are an open set. The representation is simplified, e.g. authorship of the text is recorded only once for all Bulgarian and samples in other languages. As a further advantage, graph representation allows flexible extension with new relations and categories and shows where merging or splitting categories is permissible.

Example of graph representation of corpus metadata

The metadata are as detailed as possible in order to ensure easy text classification, corpus restructuring and evaluation, derivation of subcorpora based on a set of criteria (e.g., year of publication, domain). Some of the metadata categories, labelled with _info, are optional and contain additional details about the main category. A multiple domain description was also included to cater for the description of texts which have mixed domain features. So far, extensive metadata are provided for the Bulgarian and the English part of the BulNC, while the corresponding texts from the other languages share the common metadata (author, title, etc.) and inherit the classificatory information for style, domain, and genre.