Data – Multilingual clinical case summarization

Training Dataset

The training set includes +25K full case-summary pairs for each of the 15 task languages and is now available on Zenodo. Participants can now download the data to explore the task structure and start developing the text summarization models.

MutiClinSum-2 Sample set: https://zenodo.org/records/18663291

Dataset Overview

MultiClinSum-2 provides an extensive multilingual dataset for automatic summarization of clinical case reports across 15 languages. The dataset combines large-scale translated case reports with carefully curated native-language clinical cases, offering participants high-quality resources for developing and evaluating summarization systems in diverse linguistic clinical contexts.

Language Coverage

MultiClinSum-2 encompasses 15 languages, expanding significantly from the first edition:

English (1st edition) – 26943 pairs
French(1st edition) – 26455 pairs
Spanish(1st edition) – 26743 pairs
Portuguese (1st edition) – 26593 pairs
Swedish – 26594 pairs
Italian – 26553 pairs
Russian – 26086 pairs
Catalan – 26699 pairs
Norwegian – 26609 pairs
Danish – 26652 pairs
Romanian – 26579 pairs
German – 26455 pairs
Greek – 25749 pairs
Dutch – 26587 pairs
Czech – 26488 pairs

Data Sources

The MultiClinSum-2 dataset was built from two complementary biomedical data sources:

1. Native Clinical Case Reports

A carefully curated collection of approximately 1k full case-summary native pairs, manually selected from PubMed. These case reports were originally published in four languages: English, Spanish, French, and Portuguese, and are accompanied by manually written authors, ensuring a robust quality and clinical accuracy. To maintain consistency across the entire language set, these native cases have been translated into all remaining 15 supported languages of the task.

2. PMC-Patients Subset (Large-Scale collection)

This large-scale data source consists of approximately 41k full case-summary pairs derived from the PMC-Patients subset of PubMed Central, where the full texts of the clinical case reports serve as the source documents for the summarization task. The corresponding summaries were extracted from specific header types of the original abstracts, such as “Summary of the Case,” “Summary,”. These cases were originally available in English, and have been translated into all 14 additional task languages applying machine translation techniques.