Training Dataset

The training set includes +25K full case-summary pairs for each of the 15 task languages and is now available on Zenodo. Participants can now download the data to explore the task structure and start developing the text summarization models.

MutiClinSum-2 Sample set: https://zenodo.org/records/18663291

Dataset Overview

MultiClinSum-2 provides an extensive multilingual dataset for automatic summarization of clinical case reports across 15 languages. The dataset combines large-scale translated case reports with carefully curated native-language clinical cases, offering participants high-quality resources for developing and evaluating summarization systems in diverse linguistic clinical contexts.

Language Coverage

MultiClinSum-2 encompasses 15 languages, expanding significantly from the first edition:

  • English (1st edition) – 26943 pairs
  • French(1st edition) – 26455 pairs
  • Spanish(1st edition) – 26743 pairs
  • Portuguese (1st edition) – 26593 pairs
  • Swedish – 26594 pairs
  • Italian – 26553 pairs
  • Russian – 26086 pairs
  • Catalan – 26699 pairs
  • Norwegian – 26609 pairs
  • Danish – 26652 pairs
  • Romanian – 26579 pairs
  • German – 26455 pairs
  • Greek – 25749 pairs
  • Dutch – 26587 pairs
  • Czech – 26488 pairs

Data Sources

The MultiClinSum-2 dataset was built from two complementary biomedical data sources:

1. Native Clinical Case Reports

A carefully curated collection of approximately 1k full case-summary native pairs, manually selected from PubMed. These case reports were originally published in four languages: English, Spanish, French, and Portuguese, and are accompanied by manually written authors, ensuring a robust quality and clinical accuracy. To maintain consistency across the entire language set, these native cases have been translated into all remaining 15 supported languages of the task.

2. PMC-Patients Subset (Large-Scale collection)

This large-scale data source consists of approximately 41k full case-summary pairs derived from the PMC-Patients subset of PubMed Central, where the full texts of the clinical case reports serve as the source documents for the summarization task. The corresponding summaries were extracted from specific header types of the original abstracts, such as “Summary of the Case,” “Summary,”. These cases were originally available in English, and have been translated into all 14 additional task languages applying machine translation techniques.