📊 Metrics Explanation
In this section, leaderboard results are reported per language sub-track. Each row corresponds to a team submission (one run). Scores are macro-averaged over all test pairs for the given language and ordered by BERTScore metric.
Metrics
Classic metrics:
- ROUGE-1,2 — unigram/bigram overlap between generated and reference summary (F1).
- ROUGE-Lsum — longest common subsequence overlap, computed sentence-by-sentence (F1).
- BERTScore — semantic similarity using multilingual DeBERTa embeddings (F1)
LLM-as-Judge metrics (evaluated by an LLM judge powered by EuroLLM-9B):
- Faithfulness — whether the summary contains only information present in the original case (no hallucinations). Scored 0–1.
- Completeness — how thoroughly the summary covers key clinical information (patient demographics, diagnosis, intervention, outcome, follow-up) following CARE guidelines. Scored 0–1.
- Fluency — grammatical correctness and readability of the generated text, assessed in the target language. Scored 0–1.
- Consistency — factual equivalence across all language versions of the same case summary (only meaningful for teams that submitted multiple languages). Scored 0–1.
All scores range from 0 to 1; higher is better.
Rankings are computed independently per metric column — a system can rank first in BERTScore and lower in Faithfulness.
Note: Additional classic metrics (BARTScore and SummaC) will be implemented in next steps.
Colour meaning
🥇 Gold – 1st place for that metric
🥈 Silver – 2nd place
🥉 Bronze – 3rd place