Wiki1000+ corpus with annotated MWEs « Секция по компютърна лингвистика

General description

Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ contains 6311 text samples and 13.4 million tokenс. The corpus is a part of the Bulgarian National Corpus.

Compilation

The corpus is collected automatically via a web crawler which crawls all pages in the Bulgarian section of Wikipedia and stores all valid documents. Simultaneously, metadata are extracted and stored in the format of the Bulgarian National Corpus.

The flat XML format is preferred for representation of annotation due to its robustness and easy processing. The special format Wikipedia pages were used Special:Export, which are generated by the wiki-software upon request. The full Wikipedia corpus comprises 176,622 text files and 41 million words.

Wiki1000+ includes texts with more than 1000 tokens to reduce processing time.

Format of the data and coma

Example:
<word w="Редом" l="редом" sen="13" pos="D" /> <word w="с" l="с" sen="13" pos="R" /> <word w="тези" l="този" sen="13" pos="PDOp" /> <word w="названия" l="название" sen="13" pos="NCNpon" /> <word w="местното" l="местен" sen="13" pos="Asnd" mwe="2:0" mwe_type="7" /> <word w="население" l="население" sen="13" pos="NCNson" mwe="2:1" mwe_type="7" /> <word w="я" l="аз" sen="13" pos="PHza3sf" /> <word w="нарича" l="наричам" sen="13" pos="VLITe2s" /> <word w="и" l="и" sen="13" pos="C" /> <word w="с" l="с" sen="13" pos="R" /> <word w="името" l="име" sen="13" pos="NCNsdn" /> <word w="“" l="“" sen="13" pos="U" /> <word w="Балзена" l="балзена" sen="13" pos="Ns" /> <word w="”" l="”" sen="13" pos="U" /> <word w="." l="." sen="13" pos="U" />

The corpus is processed with the following applications: sentence splitter, tokeniser, POS ans grammatical tagger, lemmatiser. The MWEs contain two or more words, each of which is tagged with the MWE id and the consequtive number of the respective component within the structure of the MWE. The MWEs are also labelled with their type based on idiomaticity classification (whether the MWE is a named entity; whether it refers to a Named Entity; to what degree the meaning of the MWE is compositional).

Classification

Domain	Label	# Texts	# Number of word
Archeology	A-Archeology	5	10250
Biology	B-Biology	70	134115
Chemistry	C-Chemistry	25	56,127
Physics	D-Physics	23	47,786
Economics	E-Economics	98	20,3368
Philosophy	F-Philosophy	157	342,099
Geography	G-Geography	1,102	2,267,690
History	H-History	505	1,048,621
Literature	I-Literature	37	66,902
Medicine	J-Medicine	58	117,123
Astronomy	K-Astronomy	20	59,418
Linguistics	L-Linguistics	18	34,649
Maths	M-Maths	34	61,622
Sociology	N-Sociology	14	41,878
Psychology	O-Psychology	17	31,970
Education	P-Education	69	125,177
Law	Q-Law	17	34,341
Тechnology	R-Тechnology	119	255,550
Politics	S-Politics	459	1,038,629
Culture	T-Culture	253	502,641
Architecture	U-Architecture	12	31,116
Sport	V-Sport	135	315,819
Military	W-Military	250	497,445
Popular	Y-Popular	5	7,537
Unknown.	Z	2,809	6,101,939
Total		6,311	13,433,812