Author Archive: Svetlozara Lesseva

Dr. Veselin Stoyanov

Posted by Svetlozara Lesseva On March 27th, 2024

Short bio

Dr. Veselin Stoyanov is a Researcher with a track record of innovating in AI and NLP to solve real-world problems. He is currently the Head of AI at Tome building practical applications of LLMs. He was previously at Facebook AI, where he led the development of industry-standard Large LM methods such as RoBERTa, XLM-R, and MultiRay and their application to improve online experiences, e.g., reduce the prevalence of hate speech and bullying posts. He holds a Ph.D. from Cornell University and a Post-Doctoral degree from Johns Hopkins University.

Talk abstract

Large Language Models for the Real World: Explorations of Sparse, Cross-lingual Understanding and Instruction-Tuned LLM

Large language models (LLMs) have revolutionized NLP and the use of Natural Language in products. Nonetheless, there are challenges to the wide adoption of LLMs. In this talk, I will describe my explorations into addressing some of those challenges. I will cover work on sparse models addressing high computational costs, multilingual LLMs addressing the need to handle many languages, and work on instruction finetuning addressing the alignment between model outputs and human needs.

Prof. Vito Pirrelli

Posted by Svetlozara Lesseva On February 29th, 2024

Short bio

Prof. Vito Pirrelli is Research manager at the National Research Council Institute for Computational Linguistics Antonio Zampolli since 2003, he is head of the Laboratory for Communication Physiology, and co-editor in chief of the Mental Lexicon and Lingue e Linguaggio. His main research interests focus on fundamental issues of language architecture and physiology, lying at the interdisciplinary crossroad of cognitive linguistics, psycholinguistics, neuroscience and information science.

Over the last 20 years, he has been leading a data-driven research program that uses artificial neural networks, language models and information and communication technologies to investigate language as a holistic dynamic system, emerging from interrelated patterns of sensory experience, communicative and social interaction and psychological and neurobiological mechanisms. This research program went beyond the fragmentation of mainstream NLP technologies of the early 21st century, allowing innovation to come out of research labs and address societal needs. Using portable devices and cloud computing to collect ecological multimodal language data, the Comphys Lab currently offers a battery of tools, resources and protocols that support language teaching and education assessment, cultural integration and early diagnosis and intervention of language and cognitive disorders.

In 2021, following a peer review by the relevant Class Committee, he was elected member of the Academia Europaea.

Talk abstract

Written Text Processing and the Adaptive Reading Hypothesis

Oral reading requires the fine coordination of eye movements and articulatory movements. The eye provides access to the input stimuli needed for voice articulation to unfold at a relatively constant rate, while control on articulation provides internal feedback to oculomotor control for eye movements to be directed when and where a decoding problem arises.

A factor that makes coordination of the eye and the voice particularly hard to manage is their asynchrony. Eye movements are faster than voice articulation and are much freer to scan a written text forwards and backwards. As a result, given a certain time window, the eye can typically fixate more words than the voice can articulate.

According to most scholars, readers compensate for this functional asynchrony by using their phonological buffer, a working memory stack of limited temporal capacity where fixated words can be maintained temporarily, until they are read out loud. The capacity of the phonological buffer thus puts an upper limit on the distance between the position of the voice and the position of the eye during oral text reading, known as the eye-voice span.

In my talk, I will discuss recent reading evidence showing that the eye-voice span is the “elastic” outcome of an optimally adaptive viewing strategy, interactively modulated by individual reading skills and the lexical and structural features of a text. The voice span not only varies across readers depending on their rate of articulation, but it also varies within each reader, getting larger when a larger structural unit is processed. This suggests that skilled readers can optimally coordinate articulation and fixation times for text processing, adaptively using their phonological memory buffer to process linguistic structures of different size and complexity.

Prof. Joakim Nivre

Posted by Svetlozara Lesseva On February 29th, 2024

Short bio

Prof. Joakim Nivre is Professor of Computational Linguistics at Uppsala University and Senior Researcher at RISE (Research Institutes of Sweden). He holds a Ph.D. in General Linguistics from the University of Gothenburg and a Ph.D. in Computer Science from Vaxjo University.

His research focuses on data-driven methods for natural language processing, in particular for morphosyntactic and semantic analysis. He is one of the main developers of the transition-based approach to syntactic dependency parsing, described in his 2006 book Inductive Dependency Parsing and implemented in the widely used MaltParser system, and one of the founders of the Universal Dependencies project, which aims to develop cross-linguistically consistent treebank annotation for many languages and currently involves nearly 150 languages and over 500 researchers around the world. He has produced over 300 scientific publications and has over 42,000 citations according to Google Scholar (February, 2024). He is a fellow of the Association for Computational Linguistics and was the president of the association in 2017.

Talk abstract

Ten Years of Universal Dependencies

Universal Dependencies (UD) is a project developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. Since UD was launched almost ten years ago, it has grown into a large community effort involving over 500 researchers around the world, together producing treebanks for 148 languages and enabling new research directions in both NLP and linguistics. In this talk, I will review the history and development of UD and discuss challenges that we need to face when bringing UD into the future.

Jose Manuel Gomez-Perez

Posted by Svetlozara Lesseva On August 15th, 2022

Short bio

Jose Manuel Gomez-Perez is the Director of Language Technology Research at expert.ai. He works in the intersection of several areas of artificial intelligence, combining structured knowledge and neural models to enable machine understanding of unstructured data as an analogous process to human comprehension. Jose Manuel collaborates with organizations like the European Space Agency and has advised several tech startups. A former Marie Curie fellow, he holds a Ph.D. in Computer Science and Artificial Intelligence based on his work during project Halo, an initiative of Microsoft co-founder Paul Allen to create a Digital Aristotle for life and physical sciences.

He regularly publishes in areas of AI, natural language processing and knowledge graphs, and has given invited seminars at different universities in Europe and the USA. Recently, he published the book A Practical Guide to Hybrid Natural Language Processing. Magazines like Nature and Scientific American, as well as newspapers like El Pa?s have collected his views on AI, language and vision understanding, and their applications.

Talk abstract

Towards AI that Reasons with Scientific Text and Images

Reading a textbook in a particular discipline and being able to answer the questions at the end of each chapter is one of the grand challenges of artificial intelligence, which requires advances in language, vision, problem-solving, and learning theory. Such challenges are best illustrated in the scientific domain, where complex information is presented over a variety of modalities involving not only language but also visual information, like diagrams and figures.

In this talk, we will analyze the specific challenges entailed in understanding scientific documents and share some of the recent advances in the area that enable the development of AI systems capable to answer scientific questions. In addition, we will reflect on what new developments will be required to address the next grand challenge: to create an AI system that can make major scientific discoveries by itself.

Prof. Bolette Sandford Pedersen

Posted by Svetlozara Lesseva On July 6th, 2022

Short bio

Bolette Sandford Pedersen is professor of computational linguistics, Deputy Head of the Department of Nordic Studies and Linguistics & Centre Leader of the Centre for Language Technology. Her main research interests include computational lexicography, lexical semantics and linguistic ontologies.

Bolette Sandford Pedersen was coordinator of the Nordic NORFA network SPINN on harmonisation of language resources in the Nordic countries, coordinator of the Danish Senseval2 participation on sense tagging, project manager of DanNet, package leader of lexical resources in DK-CLARIN (2008-2011), Danish coordinator of the EU project CLARA — Common Language Resources and their Applications — a Marie Curie Initial Training Network (2011-2014) and of the EU project META-NORD (2011-2013), project co-leader of the project Semantic Processing Across Domains financed by the Danish Research Council (2013-2016).

She has been member of selected scientific committees at ACL, COLING, the Global WordNet Conference, the Euralex Congress, LREC, OntoLex, among others.

Talk abstract

Lexical Conceptual Resources in the Era of Neural Language Models

Lexical conceptual resources in terms of e.g. wordnets, framenets, terminologies and ontologies have been compiled for many languages during the last decades in order to provide NLP systems with formally expressed information about the semantics of words and phrases, and about how they refer to the world. In most recent years, neural language models have become a game-changer in the NLP field – based, as they are, solely on text from large corpora. It is time we ask ourselves: What is the role of lexical conceptual resources in the era of neural language models? The claim of my talk is that they still play a crucial role since NLP systems based on textual distribution alone will always to some extent be insufficient and biased. Through my own work, which has over the years taken place in close collaboration with leading lexicographers in Denmark, I will illustrate how such conceptual resources can be compiled based on existing high-quality and continuously updated lexicographical resources and how they can be further curated by examining the distributional patterns captured in word embeddings.

Dr. Hristo Tanev (Joint Research Centre, EC, Italy)

Posted by Svetlozara Lesseva On June 30th, 2022

Short bio

Hristo Tanev is a project officer and researcher at the Joint Research Centre of the European Commission. His research spans across various areas of computational linguistics and natural language processing, including event extraction, text classification, question answering, social media mining, lexical learning, language resources, and multilingualism.

He is a co-organizer of the Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text. He has carried out research in three research institutions: University of Plovdiv Paisii Hilendarski (Bulgaria), ITC-irst (now Fondazione Bruno Kessler), Trento, Italy, and the Joint Research Centre of the European Commission, Ispra, Italy. He is among the founders of SIG SLAV (Special Interest Group of Slavic Language Processing) at ACL.

Demo abstract

Ontopopulis, a System for Learning Semantic Classes

Ontopopulis is a multilingual terminology learning system which implements several weakly supervised algorithms for terminology learning. The main algorithm in the system is a weakly supervised one which takes on its input a set of seed terms for a semantic category under consideration and an unannotated text corpus. The algorithm learns additional terms, which belong to this category. For example, for the category “environment disasters” in Bulgarian language the input seed set is: замърсяване на водите, изменение на климата, суша. The highest ranked new terms which the system learns for this semantic class are : опустиняване, обезлесяване, озонова дупка and so on.

In the demo session we are going to show how the system learns different semantic classes in Bulgarian and English.

Protected: MIC MM Object Detection Challenges 2022

Posted by Svetlozara Lesseva On May 27th, 2022

Prof. Iryna Gurevych (Technical University of Darmstadt, Germany)

Posted by Svetlozara Lesseva On December 8th, 2021

Short bio

Iryna Gurevych is a German computer scientist. She is Professor at the Department of Computer Science of the Technical University of Darmstadt and Director of Ubiquitous Knowledge Processing Lab. She has a strong background in information extraction, semantic text processing, machine learning and innovative applications of NLP to social sciences and humanities.

Iryna Gurevych has published over 300 publications in international conferences and journals and is member of programme and conference committees of more than 50 high-level conferences and workshops (ACL, EACL, NAACL, etc.). She is the holder of several awards, including the Lichtenberg-Professorship Career Award und the Emmy-Noether Career Award (both in 2007). In 2021 she received the first LOEWE-professorship of the LOEWE programme. She has been selected as a ACL Fellow 2020 for her outstanding work in natural language processing and machine learning and is the Vice-president-elect of the ACL since 2021.

Talk Abstract

Detect – Verify – Communicate: Combating Misinformation with More Realistic NLP

Dealing with misinformation is a grand challenge of the information society directed at equipping the computer users with effective tools for identifying and debunking misinformation. Current Natural Language Processing (NLP) including its fact-checking research fails to meet the expectations of real-life scenarios. In this talk, we show why the past work on fact-checking has not yet led to truly useful tools for managing misinformation, and discuss our ongoing work on more realistic solutions. NLP systems are expensive in terms of financial cost, computation, and manpower needed to create data for the learning process. With that in mind, we are pursuing research on detection of emerging misinformation topics to focus human attention on the most harmful, novel examples. Automatic methods for claim verification rely on large, high-quality datasets. To this end, we have constructed two corpora for fact checking, considering larger evidence documents and pushing the state of the art closer to the reality of combating misinformation. We further compare the capabilities of automatic, NLP-based approaches to what human fact checkers actually do, uncovering critical research directions for the future. To edify false beliefs, we are collaborating with cognitive scientists and psychologists to automatically detect and respond to attitudes of vaccine hesitancy, encouraging anti-vaxxers to change their minds with effective communication strategies.

Prof. Shuly Wintner (University of Haifa, Israel)

Posted by Svetlozara Lesseva On December 5th, 2021

Short bio

Shuly Wintner is professor of computer science at the University of Haifa, Israel. His research spans across various areas of computational linguistics and natural language processing, including formal grammars, morphology, syntax, language resources, translation, and multilingualism.

He served as the editor-in-chief of Springer’s Research on Language and Computation, a program co-chair of EACL-2006, and the general chair of EACL-2014. He was among the founders, and twice (6 years) the chair, of ACL SIG Semitic. He is currently the Chair of the EACL.

Talk abstract

The Hebrew Essay Corpus

The Hebrew Essay Corpus is an annotated corpus of Hebrew language argumentative essays authored by prospective higher-education students. The corpus includes both essays by native speakers, written as part of the psychometric exam that is used to assess their future success in academic studies; and essays authored by non-native speakers, with three different native languages, that were written as part of a language aptitude test. The corpus is uniformly encoded and stored. The non-native essays were annotated with target hypotheses whose main goal is to make the texts amenable to automatic processing (morphological and syntactic analysis).

I will describe the corpus and the error correction and annotation schemes used in its analysis. In addition, I will discuss some of the challenges involved in identifying and analyzing non-native language use in general, and propose various ways for dealing with these challenges. Then, I will present classifiers that can accurately distinguish between native and non-native authors; determine the mother tongue of the non-natives; and predict the proficiency level of non-native Hebrew learners. This is important for practical (mainly educational) applications, but the endeavor also sheds light on the features that support the classification, thereby improving our understanding of learner language in general, and transfer effects from Arabic, French, and Russian on nonnative Hebrew in particular.

Linguistic Intelligence: Computers vs. Humans (Abstract)

Posted by Svetlozara Lesseva On April 27th, 2018

Prof. Dr. Ruslan Mitkov, University of Wolverhampton

Computers are ubiquitous – they are and are used everywhere. But how good are computers at understanding and producing natural languages (e.g. English or Bulgarian)? In other words, what is the level of their linguistic intelligence? This presentation will examine the linguistic intelligence of the computers and will look at the challenges ahead…

Read on