{"id":17,"date":"2025-11-07T11:07:23","date_gmt":"2025-11-07T10:07:23","guid":{"rendered":"https:\/\/temu.bsc.es\/MultiClinAI\/?page_id=17"},"modified":"2026-02-26T16:33:49","modified_gmt":"2026-02-26T15:33:49","slug":"evaluation-submission","status":"publish","type":"page","link":"https:\/\/temu.bsc.es\/MultiClinAI\/evaluation-submission\/","title":{"rendered":"Evaluation &#038; Submission"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\"><strong>Evaluation and Submission<\/strong><\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">Evaluation<\/h2>\n\n\n\n<p>Evaluation of automatic predictions for this task will be conducted under two different scenarios or sub-tracks. In all cases, the primary evaluation metrics will be <strong>micro-averaged precision, recall, and F1-scores<\/strong>. The evaluation scripts, along with proper documentation, will be freely available on GitHub to allow participating teams to test the evaluation tools locally.<\/p>\n\n\n\n<p>As for the baseline systems, for the <strong>MultiClinNER<\/strong> subtask the prediction baseline will be based on vocabulary transfer results derived from the training set entities, combined with gazetteer lookup on the test set corpus. For the <strong>MultiClinCorpus<\/strong> task, a simple lexical lookup of translated entity mentions will serve as the baseline.<\/p>\n\n\n\n<p>More information coming soon!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Metrics Definition<\/h2>\n\n\n\n<p>The following metrics are reported:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Micro-F1<\/strong>: Micro-averaged F1-score computed over all sentence segmentation decisions.<\/li>\n\n\n\n<li><strong>Micro-Precision<\/strong>: Micro-averaged precision over all sentence segmentation decisions.<\/li>\n\n\n\n<li><strong>Micro-Recall<\/strong>: Micro-averaged recall over all sentence segmentation decisions.<\/li>\n\n\n\n<li><strong>Macro-F1<\/strong>: Macro-averaged F1-score computed at the document (note) level.<\/li>\n\n\n\n<li><strong>Macro-Precision<\/strong>: Macro-averaged precision at the document level.<\/li>\n\n\n\n<li><strong>Macro-Recall<\/strong>: Macro-averaged recall at the document level.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Ranking Criteria<\/h3>\n\n\n\n<p>The official ranking of submissions will be determined according to the following priority:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Micro F1-score<\/strong><\/li>\n\n\n\n<li><strong>Macro F1-score<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Additional metrics are reported for completeness and analysis purposes.<\/p>\n\n\n\n<p>For full implementation details, participants are encouraged to consult the <strong>official evaluation scripts<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Submission<\/h2>\n\n\n\n<p>The official submission process for MultiClinAI will be conducted through the <strong>CodaBench evaluation platform<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1 \u2013 Prepare your submission files<\/h3>\n\n\n\n<p>Teams must prepare a <strong>single ZIP file<\/strong> containing all the <code>run.tsv<\/code> files corresponding to the submitted runs.<\/p>\n\n\n\n<p>Each <code>run.tsv<\/code> file must be a <strong>tab-separated (.tsv) file<\/strong> and must include a header with <strong>exactly the following columns<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>filename<\/code><\/li>\n\n\n\n<li><code>label<\/code><\/li>\n\n\n\n<li><code>start_span<\/code><\/li>\n\n\n\n<li><code>end_span<\/code><\/li>\n\n\n\n<li><code>text<\/code><\/li>\n<\/ul>\n\n\n\n<p>All columns are <strong>mandatory<\/strong>. The <code>start_span<\/code> and <code>end_span<\/code> values must correspond to <strong>character offsets<\/strong> in the original document text.<\/p>\n\n\n\n<p>Files that do not strictly follow this format may be rejected by the evaluation system.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2 \u2013 Upload to evaluation platform<\/h3>\n\n\n\n<p>The ZIP file must be uploaded to the official MultiClinAI task page on <strong>our evaluation platform<\/strong> during the evaluation phase.<\/p>\n\n\n\n<p>The evaluation platform will automatically validate and score the submission. Only submissions successfully uploaded to the evaluation platform will be considered official and included in the ranking.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3 \u2013 Send backup email (mandatory)<\/h3>\n\n\n\n<p>Immediately after uploading to the evaluation platform, teams must send an email to the task organizers including:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>exact same ZIP file<\/strong> uploaded to CodaBench (as a backup copy).<\/li>\n\n\n\n<li>A <strong>mandatory README file<\/strong> describing in detail the methodology used for each submitted run.<\/li>\n<\/ul>\n\n\n\n<p>The README must clearly explain the approach followed for each run (e.g., model architecture, training procedure, external resources, prompting strategy, projection method, hyperparameters, and any additional data used).<\/p>\n\n\n\n<p>Submissions without a README will be considered <strong>incomplete<\/strong> and may not be included in the final evaluation or official results.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Detailed deadlines, submission limits, and the official evaluation platform link will be announced before the evaluation phase opens.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Evaluation and Submission Evaluation Evaluation of automatic predictions for this task will be conducted under two different scenarios or sub-tracks. In all cases, the primary evaluation metrics will be micro-averaged precision, recall, and F1-scores. The evaluation scripts, along with proper documentation, will be freely available on GitHub to allow participating teams to test the evaluation [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-17","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/17","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/comments?post=17"}],"version-history":[{"count":9,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/17\/revisions"}],"predecessor-version":[{"id":221,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/17\/revisions\/221"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/media?parent=17"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}