Evaluation and Submission
Evaluation
Evaluation of automatic predictions for this task will be conducted under two different scenarios or sub-tracks. In all cases, the primary evaluation metrics will be micro-averaged precision, recall, and F1-scores. The evaluation scripts, along with proper documentation, will be freely available on GitHub to allow participating teams to test the evaluation tools locally.
As for the baseline systems, for the MultiClinNER subtask the prediction baseline will be based on vocabulary transfer results derived from the training set entities, combined with gazetteer lookup on the test set corpus. For the MultiClinCorpus task, a simple lexical lookup of translated entity mentions will serve as the baseline.
More information coming soon!
Metrics Definition
The following metrics are reported:
- Micro-F1: Micro-averaged F1-score computed over all sentence segmentation decisions.
- Micro-Precision: Micro-averaged precision over all sentence segmentation decisions.
- Micro-Recall: Micro-averaged recall over all sentence segmentation decisions.
- Macro-F1: Macro-averaged F1-score computed at the document (note) level.
- Macro-Precision: Macro-averaged precision at the document level.
- Macro-Recall: Macro-averaged recall at the document level.
Ranking Criteria
The official ranking of submissions will be determined according to the following priority:
- Micro F1-score
- Macro F1-score
Additional metrics are reported for completeness and analysis purposes.
For full implementation details, participants are encouraged to consult the official evaluation scripts.
Submission
The official submission process for MultiClinAI will be conducted through the CodaBench evaluation platform.
Step 1 – Prepare your submission files
Teams must prepare a single ZIP file containing all the run.tsv files corresponding to the submitted runs.
Each run.tsv file must be a tab-separated (.tsv) file and must include a header with exactly the following columns:
filenamelabelstart_spanend_spantext
All columns are mandatory. The start_span and end_span values must correspond to character offsets in the original document text.
Files that do not strictly follow this format may be rejected by the evaluation system.
Step 2 – Upload to CodaBench
The ZIP file must be uploaded to the official MultiClinAI task page on CodaBench during the evaluation phase.
CodaBench will automatically validate and score the submission. Only submissions successfully uploaded to CodaBench will be considered official and included in the ranking.
Step 3 – Send backup email (mandatory)
Immediately after uploading to CodaBench, teams must send an email to the task organizers including:
- The exact same ZIP file uploaded to CodaBench (as a backup copy).
- A mandatory README file describing in detail the methodology used for each submitted run.
The README must clearly explain the approach followed for each run (e.g., model architecture, training procedure, external resources, prompting strategy, projection method, hyperparameters, and any additional data used).
Submissions without a README will be considered incomplete and may not be included in the final evaluation or official results.
Detailed deadlines, submission limits, and the official CodaBench link will be announced before the evaluation phase opens.