Data – Multilingual clinical summarization

Test Dataset

The test set includes 1,000 full clinical case reports per language (15,000 total), derived from the same biomedical sources as the training data: the PMC-Patients subset of PubMed Central and manually curated case reports from PubMed. Each case requires participants to generate a concise summary preserving essential clinical details and information.

Languages included: English, Spanish, French, Portuguese, Italian, Russian, Catalan, Norwegian, Danish, Romanian, German, Greek, Dutch, Czech, and Swedish.

⚠️ Important: Reference summaries are NOT included in this release and will be made available after the submission deadline.

Submission deadline: May 8, 2026, 15:00 CET (AoE)

Download test set: https://zenodo.org/records/19824154

For submission guidelines and evaluation details, visit: https://temu.bsc.es/multiclinsum2/evaluation-submission/

Related Datasets

Training dataset (https://zenodo.org/records/18887797)
Includes ~26,000 full case-summary pairs for each of the 15 task languages.
Sample dataset (https://zenodo.org/records/18663291)
Includes 50 full case-summary pairs for each of the 15 task languages.

Dataset Overview

MultiClinSum-2 provides an extensive multilingual dataset for automatic summarization of clinical case reports across 15 languages. The dataset combines large-scale translated case reports with carefully curated native-language clinical cases, offering participants high-quality resources for developing and evaluating summarization systems in diverse linguistic clinical contexts.

Language Coverage

MultiClinSum-2 encompasses 15 languages, expanding significantly from the first edition:

Original Languages (1st edition):

English: 26,943 pairs
Spanish: 26,743 pairs
French: 26,455 pairs
Portuguese: 26,593 pairs

New Languages (2nd edition):

Italian: 26,553 pairs
Russian: 26,086 pairs
Catalan: 26,699 pairs
Norwegian: 26,609 pairs
Danish: 26,652 pairs
Romanian: 26,579 pairs
German: 26,455 pairs
Greek: 25,749 pairs
Dutch: 26,587 pairs
Czech: 26,488 pairs
Swedish: 26,594 pairs

Data Sources

The MultiClinSum-2 dataset was built from two complementary biomedical data sources:

Native Clinical Case Reports (High-Quality Component)

A carefully curated collection of approximately 1k full case-summary native pairs, manually selected from PubMed. These case reports were originally published in four languages: English, Spanish, French, and Portuguese, and are accompanied by manually written authors, ensuring a robust quality and clinical accuracy.

To maintain consistency across the entire language set, these native cases have been translated into all remaining 15 supported languages of the task.

PMC-Patients Subset (Large-Scale collection)

This large-scale data source consists of approximately 41k full case-summary pairs derived from the PMC-Patients subset of PubMed Central, where the full texts of the clinical case reports serve as the source documents for the summarization task. The corresponding summaries were extracted from specific header types of the original abstracts, such as “Summary of the Case,” “Summary,”. These cases were originally available in English, and have been translated into all 14 additional task languages applying machine translation techniques.