Artificial Intelligence Data Kit 2030

Title: Artificial Intelligence Data Kit 2030 (AID 2030)

Duration: march–april 2023

Funding: The Artificial Intelligence Data Kit 2030 (AID 2030) project was supported by the European Language Equality.


Principal Investigator: prof. Svetla Koeva

Team members: prof. Svetla Koeva, assoc. prof. Emil Doychev, assist. prof. Valentina Stefanova, assist. prof. Georgi Cholakov.


Based on the existing studies and their in-depth analysis, we will propose an AI data kit for language understanding, generation, and transformation, as well as a set of criteria to which the data kit will be adapted depending on technological advancement and the specific technology support for different languages.

To specify the data kit (corpora, models, datasets, etc.) required to develop computer applications classified as artificial intelligence (AI). 


Flexibility: the data types and their characteristics can change with the advance of technologies, as well as in relation to different languages.

Scalability: different data kits, respectively, for languages with excellent, good, moderate, fragmentary, weak, or no support for language technologies.

Focus: targeting novel applications for language analysis, generation, and transformation based on Natural Language Understanding and Generation and geared toward General AI.

Free licenses: allowing the sharing of LT resources, services, datasets, models, and code between all stakeholders.

Standardization: ensuring data and metadata interoperability and promoting international standardization of European approaches to LT and AI, among others.


Research advances: fostering data sharing, focusing on trustworthy, standardized, and interoperable data; enabling better modeling of multimodal and multilingual environments; and showing how modalities can enrich one another.

Business driving: pouring data into developing AI, enabled by concurrent increases in high-quality data, computing capability, and high-speed communication links. 

Policymaking: changing legal frameworks to allow all unprotected language-related data to become available; clearly demonstrating the demand for investments by displaying the necessity of diverse data collections for a variety of applications and particular languages.

Copyright © 2015-2022 Department of computational linguistics. All rights reserved.