General description
Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus.
Compilation
The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian section of Wikipedia and stores all valid documents. Simultaneously, metadata are extracted and stored in the format of the Bulgarian National Corpus.
The flat XML format is preferred for representation of annotation due to its robustness and easy processing. The special format Wikipedia pages were used Special:Export, which are generated by the wiki-software upon request. The full Wikipedia corpus comprises 176,622 text files and 41 million words.
Wiki1000+ includes texts with more than 1000 tokens to reduce processing time.
Format of the data and coma
Example:
<word w="Редом" l="редом" sen="13" pos="D" />
<word w="с" l="с" sen="13" pos="R" />
<word w="тези" l="този" sen="13" pos="PDOp" />
<word w="названия" l="название" sen="13" pos="NCNpon" />
<word w="местното" l="местен" sen="13" pos="Asnd" mwe="2:0" mwe_type="7" />
<word w="население" l="население" sen="13" pos="NCNson" mwe="2:1" mwe_type="7" />
<word w="я" l="аз" sen="13" pos="PHza3sf" />
<word w="нарича" l="наричам" sen="13" pos="VLITe2s" />
<word w="и" l="и" sen="13" pos="C" />
<word w="с" l="с" sen="13" pos="R" />
<word w="името" l="име" sen="13" pos="NCNsdn" />
<word w="“" l="“" sen="13" pos="U" />
<word w="Балзена" l="балзена" sen="13" pos="Ns" />
<word w="”" l="”" sen="13" pos="U" />
<word w="." l="." sen="13" pos="U" />
The corpus is processed with the following applications: sentence splitter, tokeniser, POS ans grammatical tagger, lemmatiser. The MWEs contain two or more words, each of which is tagged with the MWE id and the consequtive number of the respective component within the structure of the MWE. The MWEs are also labelled with their type based on idiomaticity classification (whether the MWE is a named entity; whether it refers to a Named Entity; to what degree the meaning of the MWE is compositional).
Classification
Domain | Label | # Texts | # Number of word |
Archeology | A-Archeology | 5 | 10250 |
Biology | B-Biology | 70 | 134115 |
Chemistry | C-Chemistry | 25 | 56,127 |
Physics | D-Physics | 23 | 47,786 |
Economics | E-Economics | 98 | 20,3368 |
Philosophy | F-Philosophy | 157 | 342,099 |
Geography | G-Geography | 1,102 | 2,267,690 |
History | H-History | 505 | 1,048,621 |
Literature | I-Literature | 37 | 66,902 |
Medicine | J-Medicine | 58 | 117,123 |
Astronomy | K-Astronomy | 20 | 59,418 |
Linguistics | L-Linguistics | 18 | 34,649 |
Maths | M-Maths | 34 | 61,622 |
Sociology | N-Sociology | 14 | 41,878 |
Psychology | O-Psychology | 17 | 31,970 |
Education | P-Education | 69 | 125,177 |
Law | Q-Law | 17 | 34,341 |
Тechnology | R-Тechnology | 119 | 255,550 |
Politics | S-Politics | 459 | 1,038,629 |
Culture | T-Culture | 253 | 502,641 |
Architecture | U-Architecture | 12 | 31,116 |
Sport | V-Sport | 135 | 315,819 |
Military | W-Military | 250 | 497,445 |
Popular | Y-Popular | 5 | 7,537 |
Unknown. | Z | 2,809 | 6,101,939 |
Total | 6,311 | 13,433,812 |
Structure of Wiki1000+ – number of texts and words for each style directory.
Download
The corpus is distributed under Creative Commons Attribution-NonCommercial 3.0 Unported License.
The Wiki1000+ corpus can be downloaded from here.