MultiClinAI Datasets

Three different corpora will be used for the task: SpaCCC, CardioCCC and OnaCCC. Each corpus was annotated in a different context, but with similar methodologies, with the same four labels: diseases, symptoms, procedures and medications. Medication annotations will be released together with the other entities, but they will not be evaluated as part of MultiClinAI.

Example from the SpaCCC corpora in Spanish and English annotated with procedures.

SpaCCC Corpora

SpaCCC (Spanish Clinical Case Corpus) is a collection of 1,000 clinical case reports from multiple clinical specialties. The Spanish version of the SpaCCC corpus has previously been used and validated by the community as part of previous shared tasks: diseases were used as part of DisTEMIST at CLEF 2022, symptoms were part of SympTEMIST at BioCreative VIII, procedures were part of MedProcNER at CLEF 2023 and medications were part of MultiCardioNER at CLEF 2024.

To extend its potential beyond the original Spanish texts, a multilingual version of the dataset was created using machine translation and annotation projection techniques. First, the texts in the dataset were translated into six selected target languages (Czech, English, Dutch, Italian, Romanian and Swedish). Then, the entities annotated in the Gold Standard Spanish version were also translated independently. The translations for both the text and the annotations were created using DeepL, which was chosen for its apparent quality and the availability of translation models between Spanish and all target languages. 

A lexical look-up system was then used to locate the relevant translated entities in each text in each target language. The resulting dataset was then validated by bilingual clinical experts. Using the Brat annotation tool’s side-by-side comparison view, they compared the original Gold Standard in Spanish and the projected version, correcting the annotations so that they resembled the Gold Standard as much as possible. Annotation projection guidelines were also created for this task to account for possible differences between languages and difficult cases.

CardioCCC Dataset

CardioCCC (Cardiology Clinical Case Corpus) is a collection of 508 clinical case reports from the cardiology domain. They were annotated in Spanish by the same annotators as SpaCCC with diseases, symptoms, procedures and medications. The same guidelines were also used, with small modifications to adjust them for cardiology-specific information. Part of this dataset was released for the MultiCardioNER shared task at CLEF 2024. More specifically, the diseases annotations in Spanish and the medication annotations in Spanish, English and Italian were used. The multilingual version of the corpus was created using the same machine translation engine and annotation projection methodology, including the manual validation step by bilingual clinicians.

OnaCCC Datasets

OnaCCC (Original Native Clinical Case Corpus) is a collection of smaller sub-corpora of clinical case reports from open-access journals in each of the target languages (Czech, English, Dutch, Italian, Romanian and Swedish). Since most of the data in languages other than Spanish used for the task is machine translated, OnaCCC serves as an additional dataset in every target language consisting of texts natively written in each language. These datasets were also annotated using the same guidelines as the other two corpora.

Given the time and budget-related constraints of training annotators in each language, an alternative annotation methodology was devised for OnaCCC. The texts in each language were translated into Spanish, where they were annotated by the original SpaCCC and CardioCCC annotators. Following the annotation projection methodology, their annotations were then translated into the original language of each text and projected into the original texts. As a final step, the annotations were then validated by bilingual clinicals using once again Brat’s side-by-side comparison view. The final size and number of annotations is different for each OnaCCC sub-corpus, with each of them being available in Spanish as well. 


Related resources

At the NLP for Biomedical Information Analysis group (formerly Text Mining Unit), one of our missions is the open publication of datasets to train and benchmark biomedical information extraction, normalization and indexing systems. For that reason, we have released multiple datasets as part of shared tasks over the years. If you are interested in MultiClinAI, you might want to take a look at some of our resources and competitions about: