MultiClinNER Data

The corpora for all target languages is available online at zenodo: MutiClinNER

MultiClinNER Data

The MultiClinNER subtask focuses on multilingual comparable clinical entity recognition. The training data consists of manually validated annotations of three different clinical entities across seven languages: Spanish (es), English (en), Dutch (nl), Italian (it), Romanian (ro), Swedish (sv), and Czech (cz).

Data Sources

The dataset is built from several well-established clinical corpora used in previous clinical shared tasks:

SpaCCC (Spanish Clinical Case Corpus) consists of 1,000 clinical case reports covering multiple medical specialties.
CardioCCC (Cardiology Clinical Case Corpus) comprises 508 clinical case reports from the field of cardiology.
OnaCCC (Original Native Clinical Case Corpus) is a collection of smaller sub-corpora of clinical case reports from open-access journals in each of the target languages (Czech, English, Dutch, Italian, Romanian and Swedish). The number of documents varies by language due to differences in the availability of open-access clinical case reports. Overall, the size of the language-specific sub-corpora ranges from 100 to 1,132 clinical case reports, with 100 documents in Czech, 1,131 in Dutch, 1,132 in English, 317 in Italian, 113 in Romanian, and 100 in Swedish.

These corpora were originally annotated following the same annotation guidelines, which cover four clinical entity types:

DISEASE
SYMPTOM
PROCEDURE
MEDICATION

For the purposes of the MultiClinAI shared task, only the first three entity types are evaluated.

Multilingual Data Generation

The comparable multilingual versions of the corpora were created through a combination of machine translation, annotation projection, and expert validation.

First, the original manually annotated Spanish texts were translated into the six target languages using machine translation. Then, the annotated entity mentions from the Spanish Gold Standard were translated independently. A lexical lookup system was then built and used to locate the translated entity mentions within the translated texts.

Finally, bilingual clinical experts validated the projected annotations. Using the side-by-side comparison view in the Brat annotation tool, the projected annotations were corrected so that they matched the original Gold Standard annotations as closely as possible.

To complement the translated data, the OnaCCC corpus provides additional clinical case reports that were originally written in each target language. These texts were translated to Spanish and annotated following the same methodology to ensure consistency across the dataset.

Dataset Characteristics

The MultiClinNER training data therefore includes:

Clinical case reports from multiple medical domains
Texts available in seven languages
Expert-validated annotations
Three clinical entity types: DISEASE, SYMPTOM, and PROCEDURE

The dataset contains:

The original Spanish clinical case reports used as source texts
Translated texts derived from Spanish source documents
Native texts written directly in the target language

This design allows the evaluation of multilingual, cross-lingual, and language-specific NER systems.

Data Format

The dataset follows the BRAT standoff annotation format, where:

.txt files contain the clinical case reports
.ann files contain the entity annotations

Each annotation includes the entity type and the exact character offsets of the mention within the text.

Example annotation format:

T1 DISEASE 125 141 myocardial infarction

Folder Structure

The training data is organized by language and entity type to facilitate system development.

MultiClinNER/
 ├── MultiClinNER-es/
 │    ├── MultiClinNER-es-train/
 │    │    ├── MultiClinNER-es-train-disease/
 │    │    │    ├── ann/
 │    │    │    │    ├── MultiClinNER-es-train-disease-0001.ann
 │    │    │    │    ├── MultiClinNER-es-train-disease-0002.ann
 │    │    │    │    ├── ...
 │    │    │    ├── txt/
 │    │    │    │    ├── MultiClinNER-es-train-disease-0001.txt
 │    │    │    │    ├── MultiClinNER-es-train-disease-0002.txt
 │    │    │    │    ├── ...
 │    │    ├── MultiClinNER-es-train-symptom/
 │    │    │    ├── ...
 │    │    ├── MultiClinNER-es-train-procedure/
 │    │    │    ├── ...
 ├── MultiClinNER-cz/
 │    ├── MultiClinNER-cz-train/
 │    │    ├── MultiClinNER-cz-train-disease/
 │    │    │    ├── ann/
 │    │    │    │    ├── ...
 │    │    │    ├── txt/
 │    │    │    │    ├── ...
 │    │    ├── MultiClinNER-cz-train-symptom/
 │    │    │    ├── ...
 │    │    ├── MultiClinNER-cz-train-procedure/
 │    │    │    ├── ...
 ├── MultiClinNER-{nl,en,it,ro,sv}/ (same as es and cz)

Each entity type is separated to simplify experiments focusing on a single concept category.

Participants may use the data to train monolingual or multilingual systems and may submit results for any subset of languages.