EN BG

The Bulgarian-English Sentence- and Clause-Aligned Corpus

General description

The Bulgarian-English Sentence- and Clause-Aligned Corpus (BulEnAC) is an excerpt from the Bulgarian-English Parallel Corpus – a part of the Bulgarian National Corpus (BulNC) of approximately 260.7 million tokens for Bulgarian and 263.1 million tokens for English.
The BulEnAC consists of 176,397 tokens for Bulgarian and 190,468 for English (366,865 tokens altogether). The BulEnAC comprises 14,667 Bulgarian sentences (12.02 words per sentence on average) and 15,718 English sentences (12.11 words per sentence). The average number of clauses in a sentence in the Bulgarian part is 1.67 compared to 1.85 clauses per sentence for the English part.
The Bulgarian-English Parallel Corpus has been processed at several levels: tokenisation, sentence splitting, lemmatisation. The processing has been performed using the Bulgarian language processing chain for the Bulgarian part and Apache OpenNLP and Stanford CoreNLP for the English part (Koeva et al. 2012a, Koeva et al. 2012b).

Compilation

The texts are distributed over five broad categories, called ‘styles’: administrative, fiction, science, journalism, and subtitles, as follows: Administrative texts (20.5%), Fiction (21.35%), Journalistic texts (37.13%), Science (11.16%) and Informal/Fiction (9.84%).

Average length of Bulgarian and English sentences (in terms of number of clauses) across the different styles

The corpus is represented in XML format and is supplied with various linguistic annotation – monolingual for both Bulgarian and English (sentence splitting, tokenisation, lemmatisation, POS and grammatical tagging), and parallel (sentence and clause alignment).
Each word is represented as an element of type word. Each word element is defined by a set of attributes that correspond to different annotation levels:
• lexical level (lemmatisation) — the attributes w and l denote the word form and the lemma, respectively;
• syntactic (sentence level) — the combination of two attributes, e=True and sen=senID, denotes the end of each sentence and the corresponding id of the sentence in the corpus;
• syntactic (clause level) — the attribute cl corresponds to the id of the clause in which the word occurs;
• syntactic (applied only to conjunctions) — the attribute cl2 is used for conjunctions and other words and phrases that connect two clauses, and denotes the id of the clause to which the current clause is connected. The attribute m defines the type of the relation between the two clauses cl and cl2 (coordination or subordination), the direction of the relation (in the case of subordination) and the position of the conjunction with respect to the clauses;
• alignment — the attributes sen_al and cl_al define sentence and clause alignment, respectively. Corresponding sentences/clauses in the two parallel texts are assigned the same id.

Annotation

The manual sentence and clause alignment, as well as the verification and post-editing of the automatically performed alignment were carried out with a specially designed tool – ClauseChooser.
The monolingual annotation mode includes:
• sentence splitting;
• clause splitting;
• correction of wrong splitting (merging of split sentences/clauses);
• annotation of conjunctions;
• identification of the type of relation between pairs of connected clauses.

ClauseChooser’s monolingual mode

The multilingual mode uses the output of the monolingual sentence and clause splitting and supports:
• manual sentence alignment;
• manual clause alignment.

ClauseChooser’s alignment mode

Sentence and clause alignment

Both the Bulgarian and the English parts of the corpus were automatically sentence-split and sentence-aligned. The sentence segmentation of the Bulgarian part was performed with the BG Sentence Splitter. The tool identifies the sentence boundaries in a raw Bulgarian text using regular rules and a lexicon (Koeva and Genov 2011). The English part was sentence-split using an implementation of an OpenNLP pre-trained model. Sentence alignment was carried out automatically using HunAlign, and manually verified by experts.

BG:EN alignment frequency in % of all
0:1 1187 7.60
1:0 225 1.44
1:1 13697 87.74
1:2 264 1.69
2:1 187 1.20
other 15 0.33

Sentence alignment categories.

A pre-trained OpenNLP parser was used to determine the clause boundaries in the English part, followed by manual expert post-editing. The Bulgarian sentences were split into clauses manually. The task was performed in compliance with the specific syntactic rules and the established grammar tradition and annotation practices for the respective languages, thus ensuring the authenticity of the annotation decisions and outlining actual language-specific issues of multilingual alignment.
After the clause splitting has been carried out or verified, we identify the clause-introducing conjunctions, the type of relation they denote, the clauses which are involved in the relation, as well as the direction of the relation.
Finally, the parallel clauses occurring within corresponding pairs sentences have been manually aligned.

BG:EN alignment frequency in % of all
0:1 1745 7.05
1:0 482 1.95
1:1 18997 76.80
1:2 2256 9.12
1:3 239 1.33
1:4 99 0.40
2:1 621 2.51
2:2 87 0.32
other 128 0.52

Clause alignment categories.

Applications

The NLP applications of the BulEnAC encompass at least three interrelated areas:
• developing methods for automatic clause splitting and alignment;
• developing methods for clause reordering to improve the training data for SMT (Koeva 2012b);
• word and phrase alignment.

Related publications

Koeva et al. 2012a: Koeva, Svetla, Borislav Rizov, Ekaterina Tarpomanova, Tsvetana Dimitrova, Rositsa Dekova, Ivelina Stoyanova, Svetlozara Leseva, Hristina Kukova, and Angel Genov (2012a) “Application of Clause Alignment for Statistical Machine Translation”. In: Proceedings of SSST-6, Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation, Jeju, Republic of Korea, 12 July 2012, The Association for Computational Linguistics: ACL 2012 / SIGMT / SIGLEX Workshop, 2012, pp. 102-110. ISBN: 978-1-937284-38-1. pdf

Koeva et al. 2012b: Koeva, Svetla, Borislav Rizov, Ekaterina Tarpomanova, Tsvetana Dimitrova, Rositsa Dekova, Ivelina Stoyanova, Svetlozara Leseva, Hristina Kukova, and Angel Genov (2012b) “Bulgarian-English Sentence- and Clause-Aligned Corpus” – In: Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), Lisbon, 29 November 2012., Lisboa: Colibri, 2012, pp. 51-62. ISBN: 978-989-689-273-9. pdf

Koeva and Genov 2011: Koeva, Sv., Genov, A. (2011) “Bulgarian Language Processing Chain.” In Proceeding of the Workshop on the Integration of Multilingual Resources and Tools in Web Applications in conjunction with GSCL 2011, 26 September 2011, Hamburg.

Copyright © 2015 Department of computational linguistics. All rights reserved.