The classification of multiword expressions (MWEs) developed by Baldwin et al. (Baldwin, T., C. Bannard, T. Tanaka, D. Widdows. An Empirical Model of Multiword Expression Decomposability. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. 2003) who distinguish between non-decomposable, idiosyncratically decomposable and simple decomposable MWEs is adopted. Further, we divide simple decomposable MWEs into categories based on pragmatic factors – whether they are or contain a named entity (NE). Free collocations are free phrases (non-MWEs) which are statistically marked, i.e. appear with high frequency in a corpus, but are not linguistically marked.
The lists of Multiword expressions are the result of automatic and semi-automatic tagging and classification of the corpus Wiki1000+ (13.4 million tokens):
- Non-decomposable – 700,
- Idiosyncratically decomposable – 3,156,
- Simple decomposable
- NEs without connection between elements – 36,932
- NEs with a meaningful element(s) – 11,248
- Non-NEs with a vague connection between components – 1,46
- NEs with meaningful components but connection difficult to restore – 1,086
- NEs with descriptor and additional element – 18,962
- Non-NEs with a NE as one of the components – 27,373
- Non-NEs with a standard, easy to restore connection between components- 140,394
- NEs with a standard, easy to restore connection between components – 16,653
- Non-NEs with explicit connection between components – 1,468,
- “Free collocations” – 49,651,
- Free phrases- 1,197,762.