EN BG

Wiki1000+ corpus with annotated MWEs

General description

Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus.

Compilation

The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian section of Wikipedia and stores all valid documents. Simultaneously, metadata are extracted and stored in the format of the Bulgarian National Corpus.

The flat XML format is preferred for representation of annotation due to its robustness and easy processing. The special format Wikipedia pages were used Special:Export, which are generated by the wiki-software upon request. The full Wikipedia corpus comprises 176,622 text files and 41 million words.

Wiki1000+ includes texts with more than 1000 tokens to reduce processing time.

Format of the data and coma

Example:

<word w="Редом" l="редом" sen="13" pos="D" />
<word w="с" l="с" sen="13" pos="R" />
<word w="тези" l="този" sen="13" pos="PDOp" />
<word w="названия" l="название" sen="13" pos="NCNpon" />
<word w="местното" l="местен" sen="13" pos="Asnd" mwe="2:0" mwe_type="7" />
<word w="население" l="население" sen="13" pos="NCNson" mwe="2:1" mwe_type="7" />
<word w="я" l="аз" sen="13" pos="PHza3sf" />
<word w="нарича" l="наричам" sen="13" pos="VLITe2s" />
<word w="и" l="и" sen="13" pos="C" />
<word w="с" l="с" sen="13" pos="R" />
<word w="името" l="име" sen="13" pos="NCNsdn" />
<word w="“" l="“" sen="13" pos="U" />
<word w="Балзена" l="балзена" sen="13" pos="Ns" />
<word w="”" l="”" sen="13" pos="U" />
<word w="." l="." sen="13" pos="U" />

The corpus is processed with the following applications: sentence splitter, tokeniser, POS ans grammatical tagger, lemmatiser. The MWEs contain two or more words, each of which is tagged with the MWE id and the consequtive number of the respective component within the structure of the MWE. The MWEs are also labelled with their type based on idiomaticity classification (whether the MWE is a named entity; whether it refers to a Named Entity; to what degree the meaning of the MWE is compositional).

Classification

Domain Label # Texts # Number of word
Archeology A-Archeology 5 10250
Biology B-Biology 70 134115
Chemistry C-Chemistry 25 56,127
Physics D-Physics 23 47,786
Economics E-Economics 98 20,3368
Philosophy F-Philosophy 157 342,099
Geography G-Geography 1,102 2,267,690
History H-History 505 1,048,621
Literature I-Literature 37 66,902
Medicine J-Medicine 58 117,123
Astronomy K-Astronomy 20 59,418
Linguistics L-Linguistics 18 34,649
Maths M-Maths 34 61,622
Sociology N-Sociology 14 41,878
Psychology O-Psychology 17 31,970
Education P-Education 69 125,177
Law Q-Law 17 34,341
Тechnology R-Тechnology 119 255,550
Politics S-Politics 459 1,038,629
Culture T-Culture 253 502,641
Architecture U-Architecture 12 31,116
Sport V-Sport 135 315,819
Military W-Military 250 497,445
Popular Y-Popular 5 7,537
Unknown. Z 2,809 6,101,939
Total 6,311 13,433,812

Structure of Wiki1000+ – number of texts and words for each style directory.

Download

The corpus is distributed under Creative Commons Attribution-NonCommercial 3.0 Unported License.
Creative Commons Licence

The Wiki1000+ corpus can be downloaded from here.

Copyright © 2015 Department of computational linguistics. All rights reserved.