Corpus Description

About the texts

MultiCardioNER, being focused on the adaptation of general clinical models to specific specialties and the creation of multilingual models, uses multiple datasets:

The DisTEMIST and DrugTEMIST (newly-released for this task) corpora are a collection of 1,000 clinical cases in Spanish from different medical specialties (incl. oncology, otorhinolaryngology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology and more). They are annotated with disease and medication mentions, respectively. Both of them use the same text documents, which belong to the SPACCC corpus and are also the same ones used in MedProcNER/ProcTEMIST and SympTEMIST, making all four datasets complementary for medical entity recognition. The DrugTEMIST corpus is also released in English and Italian.

A collection of cardiology clinical case reports (CardioCCC) is used for the domain adaptation part of the task. Clinical case reports are a type of textual genre in the field of medicine that describe a patient’s medical history, symptoms, diagnosis, and treatment in detail. The dataset contains 508 documents, split in 258 for development and 250 for testing. It has been annotated with diseases and medications using the guidelines as the DisTEMIST and DrugTEMIST corpora. The medications part is released in three languages: Spanish, English and Italian. Although the 258 document split was originally divised as a development set, participants are allowed to mix-and-match their data as they see fit to create different experiments.

About the annotations

All datasets were manually annotated by clinical experts using the brat annotation tool following well-defined annotation guidelines, defined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset.

The annotations were originally created in Spanish and then transferred into English and Italian via machine translation and lexical annotation projection. The result of this process was revised by clinicians who are also native speakers of each language to validate them. To account for possible mistranslations that affected the clinical entities, these experts provided alternative suggestions for annotated entities in case they didn’t agree with the automatic translation. These translations were then integrated into the text by replacing the existing annotated span with their proposed text.

More information on the annotation scheme (and their inter-annotator agreement) is available in the Annotation Guidelines page.

About the format

MultiCardioNER is made available on two different formats: .ann (used by brat) and .tsv. For more information on brat’s format please visit: https://brat.nlplab.org/standoff.html. The .TSV file columns are explained in the dataset’s attached README file.