Evaluation Library

MEDDOPROF’s evaluation will be done using the official evaluation library, which can be downloaded from GitHub. This library is written in Python 3 and intended to be run via command line:

$ python main.py -g ../gs-data/ner/ -p ../toy-data/ner/ -s ner
$ python main.py -g ../gs-data/class/ -p ../toy-data/class -s class
$ python main.py -g ../gs-data/gs-norm.tsv -p ../toy-data/pred-norm.tsv -c ../meddoprof_valid_codes.tsv.tsv -s norm

For all subtasks, the relevant metrics are precision, recall and f1-score. The latter will be used to decide the award winners.

Evaluation Method

MEDDOPROF Shared Task’s sub-tracks will be evaluated in the following way:

Track A – MEDDOPROF-NER

Submissions will be ranked by Precision, Recall and F1-score for each PROFESION [profession] or SITUACION_LABORAL [working status] mention extracted, where the spans overlap entirely (F-score is the primary metric).

A correct prediction must have the same beginning and ending offsets as the Gold Standard annotation, as well as the same label (PROFESION or SITUACION_LABORAL)

Prediction format: brat annotation files (.ANN) with your predictions.

Track B – MEDDOPROF-CLASS

Submissions will be ranked by Precision, Recall and F1-score for each PACIENTE [patient], FAMILIAR [family member], SANITARIO [health professional] or OTROS [others] mention extracted, where the spans overlap entirely (F-score is the primary metric).

A correct prediction must have the same beginning and ending offsets as the Gold Standard annotation, as well as the same label.

Prediction format: brat annotation files (.ANN) with your predictions.

Track C – MEDDOPROF-NORM

For this track, participants will be provided with a list of unique concept identifiers from the European Skills, Competences, Qualifications and Occupations (ESCO) classification and relevant SNOMED-CT terms. Participants will have to detect PROFESION and SITUACION_LABORAL mentions and map each of them to one of the terms in the list. Then, their mappings will be compared to the manually annotated concept ids and evaluated using F1-score.

Precision, Recall and F1-score will be calculated using the following formula:

Precision (P) = true positives/(true positives + false positives)

Recall (R) = true positives/(true positives + false negatives)

F-score (F1) = 2*((P*R)/(P+R))