The corpora for all target languages is available online at zenodo: MutiClinCorpus
MultiClinCorpus Data
The MultiClinCorpus subtask focuses on the automatic creation of comparable multilingual clinical corpora through cross-lingual, weakly supervised methods for training data generation. Unlike MultiClinNER, where systems extract entities directly from text, this task requires systems to generate annotated corpora in multiple target languages starting from a Spanish seed corpus. Participants are encouraged to explore a range of automatic approaches for transferring or inducing named entity labels across languages—particularly methods designed to support low-resource settings and to reduce the high cost and scarcity of manual annotation.
Task Setting
Participants are provided with:
- A Spanish Gold Standard corpus with manually annotated entities.
- The translated versions of the same texts in six target languages.
- Training examples where entity correspondences across languages have been manually validated.
The goal is to automatically identify the exact character offsets of the corresponding entity mentions in the translated texts.
Data Sources
The underlying textual resources are the same as those used in MultiClinNER:
- SpaCCC
- CardioCCC
- OnaCCC
These corpora include clinical case reports covering multiple medical specialties and were annotated using the same guidelines to ensure consistency across datasets.
The entity types included in the task are:
- DISEASE
- SYMPTOM
- PROCEDURE
The seed language for annotation projection is Spanish, and projections are required for:
- Czech
- English
- Dutch
- Italian
- Romanian
- Swedish
Dataset Characteristics
The resulting dataset provides:
- Parallel clinical case reports across seven languages
- Expert-validated cross-lingual entity correspondences
- Three clinically relevant entity categories
- Texts from both translated and native clinical documents
This dataset enables the development and evaluation of methods such as:
- Annotation projection
- Word alignment
- Multilingual representation learning
- Generative approaches
Data Format
The data is distributed in BRAT standoff format:
.txtfiles contain the document text..annfiles contain the validated entity spans.
Each entity annotation corresponds to a projected mention aligned with the Spanish Gold Standard.
Example annotation format:
(es)
T1 DISEASE 253 285 insuficiencia respiratoria grave
(en)
T1 DISEASE 238 264 severe respiratory failure
Folder Structure
The training data is organized by language and entity type to facilitate system development.
MultiClinCorpus/
├── MultiClinCorpus-es/
│ ├── MultiClinCorpus-es-train/
│ │ ├── MultiClinCorpus-es-train-disease/
│ │ │ ├── ann/
│ │ │ │ ├── MultiClinCorpus-es-train-disease-0001.ann
│ │ │ │ ├── MultiClinCorpus-es-train-disease-0002.ann
│ │ │ │ ├── ...
│ │ │ ├── txt/
│ │ │ │ ├── MultiClinCorpus-es-train-disease-0001.txt
│ │ │ │ ├── MultiClinCorpus-es-train-disease-0002.txt
│ │ │ │ ├── ...
│ │ ├── MultiClinCorpus-es-train-symptom/
│ │ │ ├── ...
│ │ ├── MultiClinCorpus-es-train-procedure/
│ │ │ ├── ...
├── MultiClinCorpus-cz/
│ ├── MultiClinCorpus-cz-train/
│ │ ├── MultiClinCorpus-cz-train-disease/
│ │ │ ├── ann/
│ │ │ │ ├── ...
│ │ │ ├── txt/
│ │ │ │ ├── ...
│ │ ├── MultiClinCorpus-cz-train-symptom/
│ │ │ ├── ...
│ │ ├── MultiClinCorpus-cz-train-procedure/
│ │ │ ├── ...
├── MultiClinCorpus-{nl,en,it,ro,sv}/ (same as es and cz)
This structure enables participants to easily identify the seed annotations in Spanish and the corresponding annotations in the target languages.
Participants may submit results for any target language, and systems may use any modeling approach capable of identifying the projected entity spans.