Motivation

Named entity recognition (NER) systems play a key role in clinical natural language processing (NLP) applications by identifying essential clinical variables or concept types—such as diseases and comorbidities, medications, signs and symptoms or clinical procedures—in medical documents and EHRs. Such systems have proven valuable in optimizing clinical workflows, enhancing decision support, and supporting large-scale health data analysis. However, the development and evaluation of robust clinical NER systems depends heavily on the availability of well-annotated corpora. These corpora must be carefully curated by clinical experts to ensure accuracy—a process that is both time-consuming and expensive. Moreover, annotated corpora are often language-specific, creating significant challenges in multilingual contexts, particularly for less widely spoken languages.

Certain settings such as multicentric clinical trials, large multinational clinical studies, rare disease characterization, comparative analysis between clinical sites, and international medical collaborations and clinical research require comparable annotation criteria and information extraction systems across languages and clinical content written in different languages. This goal is difficult to achieve due to the scarcity of multilingual clinical corpora annotated under consistent or comparable data labeling criteria and well defined annotation guidelines and criteria.

Recent advances in machine translation, as well as LLMs and generative AI, offer promising opportunities for creating annotated corpora and applying NLP technologies across multiple languages. MT systems or LLMs can translate annotated corpora from one language to another while preserving the integrity and contextual meaning of the original annotations. The use of annotation projection, advanced entity alignment strategies, as well as LLMs for data annotation, can foster the development of comparable multilingual clinical corpora, which in turn can serve as training resources for advanced comparable multilingual clinical entity recognition techniques.

In this context, the MultiClinAI (Multilingual Clinical Entity Annotation Projection and Extraction) shared task aims at the automatic generation of comparable multilingual clinical corpora. MultiClinAI focuses on three key clinical entities of high relevance for biomedical data analysis, predictive modeling applications, and the development and evaluation of multilingual clinical named entity recognition (NER) solutions: diseases, symptoms and procedures.

The track will encompass seven different languages and will rely on expert-annotated and validated gold-standard corpora. It will also emphasize the implementation and evaluation of multilingual clinical entity recognition systems, comparing their performance across high-, medium-, and low-resource languages, as well as across languages from different language families and various clinically relevant concept types. This setup will enable a robust benchmarking scenario for multilingual clinical NLP approaches.