Corpus-Extracted MWE Lists

The classification of multiword expressions (MWEs) developed by Baldwin et al. (Baldwin, T., C. Bannard, T. Tanaka, D. Widdows. An Empirical Model of Multiword Expression Decomposability. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. 2003) who distinguish between non-decomposable, idiosyncratically decomposable and simple decomposable MWEs is adopted. Further, we divide simple decomposable MWEs into categories based on pragmatic factors – whether they are or contain a named entity (NE). Free collocations are free phrases (non-MWEs) which are statistically marked, i.e. appear with high frequency in a corpus, but are not linguistically marked.

The lists of Multiword expressions are the result of automatic and semi-automatic tagging and classification of the corpus Wiki1000+ (13.4 million tokens):

  • Non-decomposable – 700,
  • Idiosyncratically decomposable – 3,156,
  • Simple decomposable
    • NEs without connection between elements – 36,932
    • NEs with a meaningful element(s) – 11,248
    • Non-NEs with a vague connection between components – 1,46
    • NEs with meaningful components but connection difficult to restore – 1,086
    • NEs with descriptor and additional element – 18,962
    • Non-NEs with a NE as one of the components – 27,373
    • Non-NEs with a standard, easy to restore connection between components- 140,394
    • NEs with a standard, easy to restore connection between components – 16,653
    • Non-NEs with explicit connection between components – 1,468,
  • “Free collocations” – 49,651,
  • Free phrases- 1,197,762.
Copyright © 2015-2022 Department of computational linguistics. All rights reserved.