Examples

The CodiEsp evaluation script can be downloaded from GitHub.

Please, make sure you have the latest version.


Example 1: CodiEsp-D or CodiEsp-P
Evaluate the system output pred_D.tsv against the gold standard gs_D.tsv (both inside toy_data subfolders).

$>  python3 codiespD_P_evaluation.py -g gold/toy_data/gs_D.tsv -p system/toy_data/pred_D.tsv -c codiesp_codes/codiesp-D_codes.tsv

MAP estimate: 0.444

Example 2: CodiEsp-X
Evaluate the system output pred_X.tsv against the gold standard gs_X.tsv (both inside toy_data subfolders).

$> python3 codiespX_evaluation.py -g gold/toy_data/gs_X.tsv -p system/toy_data/pred_X.tsv -cD codiesp_codes/codiesp-D_codes.tsv -cP codiesp_codes/codiesp-P_codes.tsv 

-----------------------------------------------------
Clinical case name			Precision
-----------------------------------------------------
S0000-000S0000000000000-00		nan
-----------------------------------------------------
S1889-836X2016000100006-1		0.625
-----------------------------------------------------
codiespX_evaluation.py:248: UserWarning: Some documents do not have predicted codes, document-wise Precision not computed for them.

Micro-average precision = 0.556


-----------------------------------------------------
Clinical case name			Recall
-----------------------------------------------------
S0000-000S0000000000000-00		nan
-----------------------------------------------------
S1889-836X2016000100006-1		0.455
-----------------------------------------------------
codiespX_evaluation.py:260: UserWarning: Some documents do not have Gold Standard codes, document-wise Recall not computed for them.

Micro-average recall = 0.385


-----------------------------------------------------
Clinical case name			F-score
-----------------------------------------------------
S0000-000S0000000000000-00		nan
-----------------------------------------------------
S1889-836X2016000100006-1		0.526
-----------------------------------------------------
codiespX_evaluation.py:271: UserWarning: Some documents do not have predicted codes, document-wise F-score not computed for them.
codiespX_evaluation.py:274: UserWarning: Some documents do not have Gold Standard codes, document-wise F-score not computed for them.

Micro-average F-score = 0.455


__________________________________________________________

MICRO-AVERAGE STATISTICS:

Micro-average precision = 0.556

Micro-average recall = 0.385

Micro-average F-score = 0.455

Contact for technical issues

Antonio Miranda-Escalada (antonio.miranda@bsc.es)

Evaluation Library

The CodiEsp evaluation script can be downloaded from GitHub.

Please, make sure you have the latest version.


Introduction

These scripts are distributed as part of the Clinical Cases Coding in Spanish language Track (CodiEsp). They are intended to be run via command line:

$> python3 codiespD_P_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -c /path/to/codes.tsv
$> python3 codiespX_evaluate.py -g /path/to/gold_standard.tsv -p /path/to/predictions.tsv -cD /path/to/codes-D.tsv -cP path/to/codes-P.tsv

They produce the evaluation metrics for the corresponding sub-tracks (Mean Average Precision for the sub-tracks CodiEsp-D and CodiEsp-P and the custom evaluation score of sub-track CodiEsp-X).

gold_standard.tsv must be the gold standard files distributed in the CodiEsp Track webpage or have the same format.

predictions.tsv must be the predictions file. For CodiEsp-D and CodiEsp-P, it is a tab-separated file with two columns: clinical case and code. Codes must be ordered by rank. For example:

CodiEsp-D and CodiEsp-P predictions example screenshot.

For CodiEsp-X, the file predictions.tsv is also a tab-separated file. In this case, with four columns: clinical case, reference position, code label, code. For example:

CodiEsp-X Predictions example screenshot.

codes.tsv must be the files with the valid codes downloaded from Zenodo.

Prerequisites

This software requires to have Python 3 installed on your system with the libraries Pandas, NumPy, SciPy, Matplotlib, and trectools.

Directory structure

The directory structure of the evaluation library GitHub repository is not mandatory to run the Python scripts.

Usage

Both scripts accept the same two parameters:

  • The --gs_path (-g) option specifies the path to the Gold Standard file.
  • The --pred_path (-p) option specifies the path to the predictions file.

In addition, codiespD_P_evaluate.py requires an extra parameter:

  • The --valid_codes_path (-c) option specifies the path to the list of valid codes for the CodiEsp subtask we are evaluating.

Finally, codiespX_evaluate.py requires two extra parameters:

  • The --valid_codes_D_path (-cD) option specifies the path to the list of valid codes for the CodiEsp-D subtask.
  • The --valid_codes_P_path (-cP) option specifies the path to the list of valid codes for the CodiEsp-P subtask.

Contact for technical issues

Antonio Miranda-Escalada (antonio.miranda@bsc.es)

License

Awards

The Plan for the Advancement of Language Technology (Plan TL) aims to promote the development of natural language processing, machine translation and conversational systems in Spanish and co-official languages. In the framework of this plan, we announce the call for the shared task awards detailed below.

The first classified in each of the sub-task will receive a prize of 1,000 euros, the second classified in each of the sub-task will receive a prize of 500 euros and the third classified in each of the sub-task will receive a prize of 200 euros.

Registration: Fill in an online registration form. See Registration for further details.

Deadlines for submission: The deadline for submission is May 3, 2020, and the resolution of awards will be after CLEF conference on September  22-25, 2020. For further details, please refer to the Schedule.

Evaluation: The evaluation of the automatic predictions for this task will have three different scenarios or sub-tasks:

  1. CodiEsp-D. Main evaluation metric: Mean Average Precision.
  2. CodiEsp-P. Main evaluation metric: Mean Average Precision.
  3. CodiEsp-X. Main evaluation metric: F-score.

For further details on the evaluation of the sub-tasks, please refer to Evaluation.

Selection of winners: The first 3 classified in the tasks will be selected as finalists to receive prizes. System evaluations will be performed according to the evaluation criteria described in Evaluation.

Contact: For further details, please refer to Martin Krallinger at encargo-pln-life@bsc.es

Official CodiEsp results

**IAM Explainability results are unofficial (see overview paper for more information)

Annotation Guidelines

Datasets can be already Download it from Zenodo.

A complete list of relevant Diagnóstico and Procedimiento codes for the task is available at Zenodo.

Annotation guidelines may be found here.


The annotation process of the CodiEsp corpus was carried out in collaboration with terminology experts. Annotators followed an iterative process of training until a high inter-annotator agreement (IAA) was reached. The final, inter-annotator agreement measure obtained for this corpus was calculated on a set of 50 records that were double annotated (blinded) by two different expert annotators, reaching a pairwise agreement of 80.5% on the annotated codes.

Documents were coded with the 2018 version of CIE10 (the Spanish official version of ICD10-Clinical Modification and ICD10-Procedures) and inspired by the “Manual de Codificación CIE-10-ES Diagnósticos 2018” and the “Manual de Codificación CIE-10-ES Procedimientos 2018” provided by the Spanish Ministry of Health. There are two types of CIE10 codes: Diagnóstico and Procedimiento.

There were differences in the annotation of Diagnóstico and Procedimiento codes:

  • Diagnóstico. ICD10-Diagnostico (equivalent to ICD10-CM) terminology is a tree-shaped terminology. Codes with a greater number of characters are more granular. Annotated codes are minimum 3-character long.
  • Procedimiento. ICD10-Procedimiento (equivalent to ICD10-PCS) terminology is an axial terminology with 7 axes. Then, every Procedimiento code has 7 characters. In a real-world scenario, coders can use context information (other documentation, ask healthcare professionals involved in the case, etc) to obtain enough information to fill the 7 characters. This is not the case in clinical cases. Therefore, codes with only 4 characters (the first 4 axes) are allowed.

A complete list of relevant Diagnóstico and Procedimiento codes for the task is available at Zenodo.

The annotation of the entire set of entity mentions was carried out by terminology experts. Moreover, technical assistance in terms of the annotation interface was offered during the entire corpus development process.

Annotation guidelines may be found here.