Results Metrics – Multilingual clinical summarization

📊 Metrics Explanation

In this section, leaderboard results are reported per language sub-track. Each row corresponds to a team submission (one run). Scores are macro-averaged over all test pairs for the given language and ordered by BERTScore metric.

Metrics

Classic metrics:

ROUGE-1,2 — unigram/bigram overlap between generated and reference summary (F1).
ROUGE-Lsum — longest common subsequence overlap, computed sentence-by-sentence (F1).
BERTScore — semantic similarity using multilingual DeBERTa embeddings (F1)

LLM-as-Judge metrics (evaluated by an LLM judge powered by EuroLLM-9B):

Faithfulness — whether the summary contains only information present in the original case (no hallucinations). Scored 0–1.
Completeness — how thoroughly the summary covers key clinical information (patient demographics, diagnosis, intervention, outcome, follow-up) following CARE guidelines. Scored 0–1.
Fluency — grammatical correctness and readability of the generated text, assessed in the target language. Scored 0–1.
Consistency — factual equivalence across all language versions of the same case summary (only meaningful for teams that submitted multiple languages). Scored 0–1.

All scores range from 0 to 1; higher is better.
Rankings are computed independently per metric column — a system can rank first in BERTScore and lower in Faithfulness.

Note: Additional classic metrics (BARTScore and SummaC) will be implemented in next steps.

Colour meaning
🥇 Gold – 1st place for that metric
🥈 Silver – 2nd place
🥉 Bronze – 3rd place