Evaluation & Submission – Multilingual clinical case summarization

Evaluation

MultiClinSum-2 employs a comprehensive and fair evaluation framework combining automatic metrics with advanced LLM-based assessment to measure the quality of generated clinical summaries across all language subtracks.

Automatic Metrics

System-generated summaries will be evaluated against gold standard reference summaries using the following metrics:

ROUGE-2: This metric measures the overlap of bigrams (two-word sequences) between the generated summary and the reference summary, providing insight into how well the system captures key phrases and clinical terminology from the original text.

BERTScore: Leveraging contextual embeddings, BERTScore evaluates semantic similarity between generated and reference summaries beyond simple lexical overlap. This metric is particularly valuable for clinical summarization, where paraphrasing and varied medical terminology may convey equivalent clinical meaning.

LLM-as-a-Judge Evaluation

To complement automatic metrics and capture detailed aspects of summary quality, we will employ large language models as evaluators. This approach enables assessment of dimensions that traditional metrics may not fully capture, such as clinical coherence, factual accuracy, completeness of key clinical information, and overall usefulness of the summary for healthcare professionals.

Submissions

Each sub-track of the MultiClinSum-2 task is independent in the sense that submissions can be done independently for any of the four sub-tracks or languages. It is NOT mandatory to generate predictions or submissions for all languages, thus teams can also generate predictions only for a single language. Also important, for a given language, it is mandatory to generate predictions for all cases in the test set, rather than just a subset.

For the submission of your predictions or runs, make sure you have correctly specified the corresponding target language. In order to do so follow the predefined naming convention for your submissions.

Allowed runs per sub-track

For each sub-track (i.e. language) a total of 5 versions or runs are allowed. For instance, in a submission to MultiClinSum-2-en subtask a total of 5 different predictions for the entire test set can be submitted to the submission page. They will be evaluated independently and only the best will be selected for the leaderboard.

You can also send only a single run, 2 runs, 3 runs, 4 runs or 5 runs in total. It is not required to send a total of 5 runs, we allow a total of 5 runs in case the participating team would like to try out different approaches, methods or settings.

In order to send the predictions of a given sub-track and run, place the generated summary files into a single directory (following the naming conventions for generated summaries specified above). For us to identify each sub-track/run combination correctly, add the language and the run number to the directory name.