The structure of the corpus adheres to three main principles: explicit definition of categories, clear-cut structure and structure flexibility. The structure is not rigid in the sense that it is not predefined. The corpus samples are supplied with extensive metadata, facilitating the extraction of subcorpora with specific structure and features.
Language reflects communication in the following aspects: function and roles of the participants (style), thematic content (domain), and compositional structure (genre). The realisation of their interconnectivity is essential in building a good model for text description and classification. The design of the corpus is therefore based on the three basic classificatory features of style, domain, and genre.
1. Style. Style is defined as a general complex text category, which combines the notions of register, mode, and discourse. The following styles are included in BulNC: Administrative, Science, Massmedia, Fiction, Informal, Informal/Fiction (film subtitles), Popular science, Popular.
2. Domain. Each style is subdivided into thematic domains. It is generally true that domains are style-dependent, although sometimes they are found across styles. For example, the scientific style is divided into categories according to scientific field, e.g., economy, political science, etc. Some of the domains of journalistic texts are similar to those of scientific texts – politics, economy, etc. Each text can be assigned to up to two domains as it was observed that there is a high percent of texts belonging to complex domains and interdisciplinary fields.
3. Genre. For our purposes we accept the interpretation where genre is associated with the internal formal features of the text, both written and spoken. A general classification of genres based on style is used in the BulNC.
More details on the classification scheme: here