The Corpus search engine allows complex linguistic queries involving different levels of annotation combined in various ways. It is designed to support monolingual and parallel corpora in a uniform way. The syntax allows search by (combinations of) word forms, grammatical tags, semantic relations. The atomic formulae allow both ordered and unordered queries, as well as all Boolean operations – negation, disjunction, conjunction, implication and equivalence. Thanks to the alignment, the corresponding sentences in parallel documents are also accessible. The hits are paginated and the matches are highlighted. The user is able to view the detailed information for a given sentence in the hit set – the sentence metadata, its context, and correspondence(s) in the other languages.
The relation symbols are enclosed in brackets //. Presently searches can be performed for synonyms /S/, for nouns, verbs, adjectives and adverbs, hyperonyms /H/ / for nouns and verbs, and the relation similar to /L/ for adjectives. The search хубав/S/ returns all synonyms of the wordхубав in the Bulgarian wordnet and their forms, which are found in the Corpus: хубав, хубава, добър, добра and so on.
The search мусака/H/ returns all hyperonyms of the word мусака and their forms: ястие, блюдо, блюда and so on. The search велик:/L/ returns all literals and their forms found in the Bulgarian wordnet and in the Corpus, which are connected to the relation similar to the word велик: значим, значима, голям, големи, важен, важно and so on.
The symbol for word forms /F/ is enclosed in brackets as it is seen as a type of grammatical relation between the main word and its word forms. The search рисувам/F/ returns all synthetic forms of the word рисувам, which are found in the Corpus. The word, which forms are searched for, may not be given in its main form in the initial search – рисуват/F/ returns all synthetic forms ofрисувам, which are found in the Corpus.
The symbols for grammatical characteristics are enclosed in curly brackets {}. Since grammatical characteristics can be viewed as an attribute to which particular values can be assigned, for example the grammatical category number has the values singular, plural and countable form for Bulgarian, so the searches are of the following kind attribute=value. The attributes and the values which can be searched for at present, as well as the symbols used for their notation, are listed below:
Part of speech POS with the values noun N, verb V, adjective A, adverb ADV, pronoun P, numeralNUM, preposition PREP, conjunction CONJ, particle PART, interjection I. Noun gender is notated byG and has the following values: masculine M, feminine F, neuter NE. Noun type NT and values common CO and personal PR. Numeral type NUMT and values countable C and ordinary O. Verb aspect VA and values perfective PE and imperfective aspect IM. Verb transitivity VT values transitive T and intransitive IN. Pronoun type PT and values personal L, possessive POSS. Number Nand values singular s, plural pf and countable form cf. Person P and values first:1, second 2 and third3. Gender FG and values masculine mf, feminine ff and neuter nf. Definiteness D and values indefinite form 0, definite form df. Time T and values Present r, Past Aorist e and Past Imperfectj. Impersonal verb form IVF with values present active y, past aorist x, past imperfect q, passive participle w and adverbial participle z.
The search /F/{FG=ff} returns all forms of feminine gender of the word син. The searchсин/F/{D=df} returns all definite forms of the word син.
The symbol * notates any word characterized by a particular set of grammatical features. For example the search *{POS=A} returns all adjectives.
The ordered search is enclosed in angular brackets <>2. For example the search <хубав/F/{D=df} *{POS=N}> returns the definite forms of the adjective хубав in front of a noun: хубавите дами, хубавото птиче, хубавите мебели. The search <*{POS=A} and *{POS=A}> finds two adjectives connected by the coordinative conjunction and, for example: малкия и средния, културна и просветна and so on. The search <*{POS=A} *{POS=N} and *{POS=A} *{POS=N}> returns noun phrases which are in relation of coordination: елитните университети и средните училища, различни похвати и различни теми and so on.
Random words in the ordered search are notated by square brackets [], and their number from – until: with numbers. The search <на [1,2] ден> finds the sequences of the preposition на, at least one and at most two random words and the word ден. The search <на [2,2] *{POS=N}> finds the preposition на, two random words and a noun.
Conjunction is notated by &, disjunction – by |, negation – by !, implication – by =>, equivalency – by <=>. Round brackets () are used to denote grouping. The search този&нов finds simultaneous occurrences of този and нов. The search този|нов finds alternating occurrences of този and нов. The search !български/F/&банка returns the sentences, where the word банка appears, but the word български is not encountered in any of its forms. The negation of the implication !(фигура=>шахмат/F/) finds all sentences, where only фигура is encountered.