Evaluation and Submission

Evaluation

Evaluation of automatic predictions for this task will be conducted under two different scenarios or sub-tracks. In all cases, the primary evaluation metrics will be micro-averaged precision, recall, and F1-scores. The evaluation scripts, along with proper documentation, will be freely available on GitHub to allow participating teams to test the evaluation tools locally.

As for the baseline systems, for the MultiClinNER subtask the prediction baseline will be based on vocabulary transfer results derived from the training set entities, combined with gazetteer lookup on the test set corpus. For the MultiClinCorpus task, a simple lexical lookup of translated entity mentions will serve as the baseline.

More information coming soon!

Metrics Definition

The following metrics are reported:

Micro-F1: Micro-averaged F1-score computed over all sentence segmentation decisions.
Micro-Precision: Micro-averaged precision over all sentence segmentation decisions.
Micro-Recall: Micro-averaged recall over all sentence segmentation decisions.
Macro-F1: Macro-averaged F1-score computed at the document (note) level.
Macro-Precision: Macro-averaged precision at the document level.
Macro-Recall: Macro-averaged recall at the document level.

Ranking Criteria

The official ranking of submissions will be determined according to the following priority:

Micro F1-score
Macro F1-score

Additional metrics are reported for completeness and analysis purposes.

For full implementation details, participants are encouraged to consult the official evaluation scripts.

Submission

The official submission process for MultiClinAI will be conducted through the CodaBench evaluation platform.

Step 1 – Prepare your submission files

Teams must prepare a single ZIP file containing all the run.tsv files corresponding to the submitted runs.

Each run.tsv file must be a tab-separated (.tsv) file and must include a header with exactly the following columns:

filename
label
start_span
end_span
text

All columns are mandatory. The start_span and end_span values must correspond to character offsets in the original document text.

Files that do not strictly follow this format may be rejected by the evaluation system.

Step 2 – Upload to evaluation platform

The ZIP file must be uploaded to the official MultiClinAI task page on our evaluation platform during the evaluation phase.

The evaluation platform will automatically validate and score the submission. Only submissions successfully uploaded to the evaluation platform will be considered official and included in the ranking.

Step 3 – Send backup email (mandatory)

Immediately after uploading to the evaluation platform, teams must send an email to the task organizers including:

The exact same ZIP file uploaded to CodaBench (as a backup copy).
A mandatory README file describing in detail the methodology used for each submitted run.

The README must clearly explain the approach followed for each run (e.g., model architecture, training procedure, external resources, prompting strategy, projection method, hyperparameters, and any additional data used).

Submissions without a README will be considered incomplete and may not be included in the final evaluation or official results.

Detailed deadlines, submission limits, and the official evaluation platform link will be announced before the evaluation phase opens.