Evaluation & Submission – Multilingual clinical summarization

Evaluation Metrics

MultiClinSum-2 employs a comprehensive and fair evaluation framework combining automatic metrics with advanced LLM-based assessment to measure the quality of generated clinical summaries across all language sub-tracks. System-generated summaries will be evaluated against gold standard reference summaries using the following metrics:

ROUGE: This metric measures the overlap of bigrams (two-word sequences) between the generated summary and the reference summary, providing insight into how well the system captures key phrases and clinical terminology from the original text.
Rouge-Lsum: A sentence-level variant of the ROUGE metric that provides informative measures of summary quality by assessing the overlap of longest common subsequences between candidate and reference summaries. This offers valuable insights into the coverage and faithfulness of the generated content with respect to the original text.
BERTScore: Leveraging contextual embeddings, BERTScore evaluates semantic similarity between generated and reference summaries beyond simple lexical overlap. This metric is particularly valuable for clinical summarization, where paraphrasing and varied medical terminology may convey equivalent clinical meaning.

LLM-as-Judge: To complement automatic metrics and capture detailed aspects of summary quality, we will employ large language models as evaluators. This approach enables assessment of dimensions that traditional metrics may not fully capture, such as clinical coherence, factual accuracy, completeness of key clinical information, and overall usefulness of the summary for healthcare professionals.

Submission

Submission Portal

All submissions must be made through the BioASQ Participant Area:
https://participants-area.bioasq.org/accounts/login/?next=/Tasks/multiclinsum/

Participation Flexibility

Each sub-track of the MultiClinSum-2 task is independent in the sense that submissions can be done independently for any of the four sub-tracks or languages. It is NOT mandatory to generate predictions or submissions for all languages, thus teams can also generate predictions only for a single language. Also important, for a given language, it is mandatory to generate predictions for all cases in the test set, rather than just a subset.

For the submission of your predictions or runs, make sure you have correctly specified the corresponding target language. In order to do so follow the predefined naming convention for your submissions.

⚠️ Important: If you choose to participate in a language, you must generate summaries for all test full cases in that language’s test set (1k cases per language). Partial submissions for a language will not be accepted.

Allowed runs per sub-track

For each sub-track (i.e. language) a total of 5 versions or runs are allowed. For instance, in a submission to MultiClinSum-2-en subtask a total of 5 different predictions for the entire test set can be submitted to the submission page. They will be evaluated independently and only the best will be selected for the leaderboard.

You can also send only a single run, 2 runs, 3 runs, 4 runs or 5 runs in total. It is not required to send a total of 5 runs, we allow a total of 5 runs in case the participating team would like to try out different approaches, methods or settings.

Submission Format

In order to send the predictions of a given sub-track and run, place the generated summary files into a single directory (following the naming conventions for generated summaries specified above) with the following structure:

Naming Conventions

ZIP file name:{team_name}.zip
- Example: BSC_NLP4BIA.zip
- Use only alphanumeric characters and underscores (no spaces or special characters)
Top-level folder:{team_name}/
- Must match the ZIP file name
Language folders: Use two-letter language codes
- Supported: en, es, fr, pt, it, ru, ca, no, da, ro, de, el, nl, cs, sv
- Only include folders for languages you’re submitting
Run folders:run1/, run2/, run3/, run4/, run5/
- Number your runs sequentially starting from 1
- Maximum 5 runs per language
Summary files:multiclinsum2_test_{id}_{lang}_sum.txt
- Must match exactly the corresponding test case ID and language from the released test set
- Example: For test full-case multiclinsum2_test_42_ca.txt, submit multiclinsum2_test_42_ca_sum.txt
- Plain text format (no formatting, no special tags)

Important:

Submit your predictions as a single .zip file with the proposed structure
Only include language folders for languages you are submitting
Each language folder must contain at least one run folder (run1)
Each run folder must contain summaries for ALL test cases in that language
Follow the exact naming conventions shown above