{"id":273,"date":"2026-05-18T08:11:50","date_gmt":"2026-05-18T08:11:50","guid":{"rendered":"https:\/\/temu.bsc.es\/multiclinsum2\/?page_id=273"},"modified":"2026-05-19T15:35:57","modified_gmt":"2026-05-19T15:35:57","slug":"leaderboard","status":"publish","type":"page","link":"https:\/\/temu.bsc.es\/multiclinsum2\/task-info\/leaderboard\/","title":{"rendered":"Results Metrics"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><strong>\ud83d\udcca Metrics Explanation<\/strong><\/h2>\n\n\n\n<p>In this section, leaderboard results are reported per language sub-track. Each row corresponds to a team submission (one run). Scores are macro-averaged over all test pairs for the given language and ordered by BERTScore metric.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Metrics<\/h4>\n\n\n\n<p><strong>Classic metrics<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><span style=\"text-decoration: underline;\"><em>ROUGE-1,2<\/em><\/span> \u2014 unigram\/bigram overlap between generated and reference summary (F1).<\/li>\n\n\n\n<li><span style=\"text-decoration: underline;\"><em>ROUGE-Lsum<\/em><\/span> \u2014 longest common subsequence overlap, computed sentence-by-sentence (F1).<\/li>\n\n\n\n<li><a href=\"https:\/\/arxiv.org\/abs\/1904.09675\" data-type=\"link\" data-id=\"https:\/\/arxiv.org\/abs\/1904.09675\"><em>BERTScore<\/em><\/a> \u2014 semantic similarity using multilingual DeBERTa embeddings (F1)<\/li>\n<\/ul>\n\n\n\n<p><strong>LLM-as-Judge metrics<\/strong> (evaluated by an LLM judge powered by <a href=\"https:\/\/huggingface.co\/blog\/eurollm-team\/eurollm-9b\" data-type=\"link\" data-id=\"https:\/\/huggingface.co\/blog\/eurollm-team\/eurollm-9b\">EuroLLM-9B<\/a>):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em><span style=\"text-decoration: underline;\">Faithfulness<\/span><\/em> \u2014 whether the summary contains only information present in the original case (no hallucinations). Scored 0\u20131.<\/li>\n\n\n\n<li><em><span style=\"text-decoration: underline;\">Completeness<\/span><\/em> \u2014 how thoroughly the summary covers key clinical information (patient demographics, diagnosis, intervention, outcome, follow-up) following&nbsp;CARE guidelines. Scored 0\u20131.<\/li>\n\n\n\n<li><em><span style=\"text-decoration: underline;\">Fluency<\/span><\/em> \u2014 grammatical correctness and readability of the generated text, assessed in the target language. Scored 0\u20131.<\/li>\n\n\n\n<li><em><span style=\"text-decoration: underline;\">Consistency<\/span><\/em> \u2014 factual equivalence across all language versions of the same&nbsp;case summary (only meaningful for teams that submitted multiple languages). Scored 0\u20131.<\/li>\n<\/ul>\n\n\n\n<p>All scores range from 0 to 1; higher is better. <br>Rankings are computed independently per metric column \u2014 a system can rank first in BERTScore and lower in Faithfulness.<\/p>\n\n\n\n\n\n<p><strong>Note<\/strong>: Additional classic metrics (<a href=\"https:\/\/github.com\/neulab\/BARTScore\" data-type=\"link\" data-id=\"https:\/\/github.com\/neulab\/BARTScore\">BARTScore<\/a> and <a href=\"https:\/\/github.com\/tingofurro\/summac\" data-type=\"link\" data-id=\"https:\/\/github.com\/tingofurro\/summac\">SummaC<\/a>) will be implemented in next steps.<\/p>\n\n\n\n<p><em><span style=\"text-decoration: underline;\">Colour meaning<\/span><\/em><br>\ud83e\udd47 Gold &#8211; 1st place for that metric <br>\ud83e\udd48 Silver &#8211; 2nd place <br>\ud83e\udd49 Bronze &#8211; 3rd place<\/p>\n\n\n","protected":false},"excerpt":{"rendered":"<p>\ud83d\udcca Metrics Explanation In this section, leaderboard results are reported per language sub-track. Each row corresponds to a team submission (one run). Scores are macro-averaged over all test pairs for the given language and ordered by BERTScore metric. Metrics Classic metrics: LLM-as-Judge metrics (evaluated by an LLM judge powered by EuroLLM-9B): All scores range from [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":10,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-273","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/pages\/273","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/comments?post=273"}],"version-history":[{"count":30,"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/pages\/273\/revisions"}],"predecessor-version":[{"id":339,"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/pages\/273\/revisions\/339"}],"up":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/pages\/10"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/multiclinsum2\/wp-json\/wp\/v2\/media?parent=273"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}