MultiClinCorpus Data

The corpora for all target languages is available online at zenodo: MutiClinCorpus

MultiClinCorpus Data

The MultiClinCorpus subtask focuses on the automatic creation of comparable multilingual clinical corpora through cross-lingual, weakly supervised methods for training data generation. Unlike MultiClinNER, where systems extract entities directly from text, this task requires systems to generate annotated corpora in multiple target languages starting from a Spanish seed corpus. Participants are encouraged to explore a range of automatic approaches for transferring or inducing named entity labels across languages—particularly methods designed to support low-resource settings and to reduce the high cost and scarcity of manual annotation.

Task Setting

Participants are provided with:

A Spanish Gold Standard corpus with manually annotated entities.
The translated versions of the same texts in six target languages.
Training examples where entity correspondences across languages have been manually validated.

The goal is to automatically identify the exact character offsets of the corresponding entity mentions in the translated texts.

Data Sources

The underlying textual resources are the same as those used in MultiClinNER:

SpaCCC
CardioCCC
OnaCCC

These corpora include clinical case reports covering multiple medical specialties and were annotated using the same guidelines to ensure consistency across datasets.

The entity types included in the task are:

DISEASE
SYMPTOM
PROCEDURE

The seed language for annotation projection is Spanish, and projections are required for:

Czech
English
Dutch
Italian
Romanian
Swedish

Dataset Characteristics

The resulting dataset provides:

Parallel clinical case reports across seven languages
Expert-validated cross-lingual entity correspondences
Three clinically relevant entity categories
Texts from both translated and native clinical documents

This dataset enables the development and evaluation of methods such as:

Annotation projection
Word alignment
Multilingual representation learning
Generative approaches

Data Format

The data is distributed in BRAT standoff format:

.txt files contain the document text.
.ann files contain the validated entity spans.

Each entity annotation corresponds to a projected mention aligned with the Spanish Gold Standard.

Example annotation format:

(es)

T1 DISEASE 253 285 insuficiencia respiratoria grave

(en)

T1 DISEASE 238 264 severe respiratory failure

Folder Structure

The training data is organized by language and entity type to facilitate system development.

MultiClinCorpus/
 ├── MultiClinCorpus-es/
 │    ├── MultiClinCorpus-es-train/
 │    │    ├── MultiClinCorpus-es-train-disease/
 │    │    │    ├── ann/
 │    │    │    │    ├── MultiClinCorpus-es-train-disease-0001.ann
 │    │    │    │    ├── MultiClinCorpus-es-train-disease-0002.ann
 │    │    │    │    ├── ...
 │    │    │    ├── txt/
 │    │    │    │    ├── MultiClinCorpus-es-train-disease-0001.txt
 │    │    │    │    ├── MultiClinCorpus-es-train-disease-0002.txt
 │    │    │    │    ├── ...
 │    │    ├── MultiClinCorpus-es-train-symptom/
 │    │    │    ├── ...
 │    │    ├── MultiClinCorpus-es-train-procedure/
 │    │    │    ├── ...
 ├── MultiClinCorpus-cz/
 │    ├── MultiClinCorpus-cz-train/
 │    │    ├── MultiClinCorpus-cz-train-disease/
 │    │    │    ├── ann/
 │    │    │    │    ├── ...
 │    │    │    ├── txt/
 │    │    │    │    ├── ...
 │    │    ├── MultiClinCorpus-cz-train-symptom/
 │    │    │    ├── ...
 │    │    ├── MultiClinCorpus-cz-train-procedure/
 │    │    │    ├── ...
 ├── MultiClinCorpus-{nl,en,it,ro,sv}/ (same as es and cz)

This structure enables participants to easily identify the seed annotations in Spanish and the corresponding annotations in the target languages.

Participants may submit results for any target language, and systems may use any modeling approach capable of identifying the projected entity spans.