TUTORIALS | Computational Linguistics in Bulgaria (CLIB-2024)

The tutorials will take place on 8 and 11 September 2024.

The sessions scheduled for the 8th of September will be held at Sofia University’s New Conference Hall (15 Tsar Osvoboditel Blvd.).

The sessions on the 11th of September will be held online.

PROGRAMME

8 September 2024

8:45 – 9:15 – Registration

9:15 – 11:00 – Socio Political Event Extraction – Tutor: Hristo Tanev (European Commission, Joint Research Centre)

9:15 – 10:00 – Session1: Approaches and overview of sociopolitical (SP) event extraction

10:00 – 10:15 – Break

10:15 – 11:00 – Session 2: Event classification and event extraction resources

11:00 – 11:15 – Coffee Break

11:15 – 13:00 – The Role of Syntactic Corpora in the Era of Large Language Models – Tutor: Petya Osenova (Sofia University St. Kl. Ohridski; Institute of Information and Communication Technologies)

11:15 – 12:00 – Session 1: An Introduction to Syntactic Corpora

12:00 – 12:15 – Break

12:15 – 13:00 – Session 2: Making the Best of Two Worlds: Syntactic Corpora and Syntactic Parsers vs. Large Language Models

13:00 – 14:15 – Lunch Break

14:15 – 17:00 – Annotating Multiword Expressions – Tutors: Verginica Barbu Mititelu (Romanian Academy, Research Institute for Artificial Intelligence), Ivelina Stoyanova (Bulgarian Academy of Sciences, Institute for Bulgarian Language)

14:15 – 15:00 – Session 1: The Phenomenon of Multiword Expressions (MWEs): Definition, Examples, Characteristics. Lexical resources of MWEs –Tutor: Verginica Barbu Mititelu

15:00 – 15:15 – Break

15:15 – 16:00 – Session 2: State-of-the-art of MWE Identification in Text – Tutor: Ivelina Stoyanova

16:00 – 16:15 – Coffee Break

16:15 – 17:00 – Session 3: A Decision Tree for Verbal MWEs – Tutor: Verginica Barbu Mititelu

11 September 2024

9:00 – 12:00 – Annotating Multiword Expressions – Tutors: Verginica Barbu Mititelu (Romanian Academy, Research Institute for Artificial Intelligence), Ivelina Stoyanova (Bulgarian Academy of Sciences, Institute for Bulgarian Language)

9:15 – 10:00 – Session 4: Annotation of Verbal MWEs in a Sample Corpus. A Hands-on Activity. Part I – Tutor: Ivelina Stoyanova

10:00 – 10:15 – Break

10:15 – 11:00 – Session 5: Annotation of Verbal MWEs in a Sample Corpus. A Hands-on Activity. Part II – Tutor: Verginica Barbu Mititelu

11:00 – 11:15 – Break

11:15 – 12:00 – Session 6: Findings, Conclusions, Discussion – Tutor: Ivelina Stoyanova

TUTORIALS DESCRIPTION

Socio Political Event Extraction

Abstract: The purpose of this tutorial is to introduce researchers and practitioners into the fast-developing technology of event extraction. In particular it will present the architecture of a state-of-the-art event extraction system for detecting socio political events called NEXUS, part of the Europe Media Monitoring project, which analyses socio political events, crimes, and disasters, and which works in several European languages.

We will give an overview with real life examples of linguistic rules such as cascade finite-state grammars and regular expressions, as well as lexical resources for parsing of event specific information, the recent approaches in machine learning (ML) for classification of sentences and documents into event classes, as well as the use of ML for extraction of event arguments.

The tutorial is intended for participants who have some experience in natural language processing and little or no experience with the event detection and extraction technology.

Tutor: Hristo Tanev (European Commission, Joint Research Centre, Ispra, Italy)

Short bio: Hristo Tanev is a researcher at the Text Mining Competence Centre at the Joint Research Centre. His main research is in the area of event extraction. He is a co-organiser of the Workshop on Challenges and Applications of Automated Extraction of Socio-Political Events from Text (CASE), which has been collocated with EMNLP, ACL, RANLP, LREC and other prestigious conferences in NLP. He has also developed the NEXUS event extraction system, which is part of the Europe Media Monitor.

Hristo Tanev has worked in the Istituto Trentino References di Cultura in the period 2001 – 2006, where he actively researched the topic of open-domain Question Answering and worked on a PostDoc project titled “MoreWeb- Multilingual Question Answering on the Web”.

The Role of Syntactic Corpora in the Era of Large Language Models

Abstract: This tutorial is aimed at the BA, MA and PhD students in Linguistics, Humanities or Computer Science who have some acquaintance with grammar and/or language corpora and/or LLMs, and would like to know more about the interaction among them. Also, they should be able to follow the presentations in English.

The main material will be in the form of SLIDE presentations. In addition, appropriate webpages will be used as well as some demos will be shown. constituency-based, dependency-based or mixed. The annotation schemas usually depend on the subsequent tasks, with handling the syntactic knowledge only, or with added lexical/sentence semantics, coreferences, world knowledge, etc.; with a tree-based only or with a graph-based representation.

Tutor: Petya Osenova (Sofia University St. Kl. Ohridski; Bulgarian Academy of Sciences, Institute of Information and Communication Technologies)

Short bio: Petya Osenova is Professor in Contemporary Bulgarian Grammar at the Faculty of Slavic Studies at Sofia University St. Kl. Ohridski and senior researcher at the Department of AI and Language Technologies of the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences. Her research interests are in the fields of formal and computational linguistics, language resources, grammar-lexicon interface.

Petya Osenova was a key person in a number of EU projects, related to eLearning, machine translation, language resources (EuroMatrixPlus, AsIsKnown, QTLeap, EUCases, among others), responsible person for the language resources in the CLARIN and DARIAH joint framework in Bulgaria CLaDA-BG, as well as the Bulgarian representative at the User Involvement Committee in CLARIN-ERIC.

She specialised in computational linguistics as a postdoctoral fellow at Tubingen University, Germany (2003) and at Groningen University, the Netherlands (2004), and was a Fulbrighter at Stanford University, the USA (2010). In 2018 Petya Osenova received the award of Clarivate Analytics for excellence in science research in South-Eastern Europe.

Annotating Multiword Expressions

Abstract: This tutorial is meant to familiarise the audience with the principles and guidelines for annotating (mainly verbal and nominal) multiword expressions in a corpus. The notion of “multiword expression” will be introduced with a view to the challenges this phenomenon raises in automatic processing of texts and the importance of creating corpora annotated with multiword expressions.

A general picture of the preoccupations with multiword expressions will be created, and then we will zoom into the activities within the previous PARSEME and current UniDive COST Actions and their efforts to offer a consistent, uniform treatment of such expressions in various languages, with an eye to capturing its universality as well as accommodating the specificities of various languages. The annotation guidelines will be ex plained and illustrated and then the audience will have the opportunity of testing them them selves and discuss the observations during the hands-on sessions.

Tutors: Verginica Barbu Mititelu (Romanian Academy, Research Institute for Artificial Intelligence), Ivelina Stoyanova (Bulgarian Academy of Sciences, Institute for Bulgarian Language)

Verginica Mititelu is a linguist working as a senior researcher for the Romanian Academy Research In stitute for Artificial Intelligence. She performed her Master studies at and received her PhD in Philology in 2010 from the University of Bucharest. She has constantly been preoccupied with and involved in the development of language resources, especially for Romanian, applying up-to-date annotation schemes and adjusting them to the characteristics of the language under study. She has also been concerned with standardising the resources developed, especially using Linked Data principles of representation, and with the registration of their metadata in international data repositories. She has taken part in a number of large-scale international projects, such as BalkaNet, CLARIN, METANET4YOU, ACCURAT.

Verginica Barbu Mititelu is the language leader for Romanian in MWE corpora annotation task in the PARSEME and UniDive COST Actions. In the latter, she also serves as a leader of the Working Group on Lexicon-Corpus Interface.

Ivelina Stoyanova works at the Department of Computational Linguistics at the Institute for Bul garian Language, Bulgarian Academy of Sciences. She has a Master’s degree in Bulgarian Studies from Sofia University and a Bachelor’s degree in Computer Science and Mathematics from the University of Bath, UK. She obtained her PhD degree from the Institute for Bulgarian Language in 2012, and her thesis was on the MWE recognition and tagging in Bulgarian. She works actively on many national and international projects on developing language resources and applications for language processing and was part of initiatives such as BalkaNet, CESAR, ATLAS, among others.

She was the language leader for Bulgarian in the MWE corpus annotation task of the PARSEME COST Action and is currently an active member of the UniDive COST Action.