Datasets can be already Download it from Zenodo.

A complete list of relevant Diagnóstico and Procedimiento codes for the task is available at Zenodo.

Annotation guidelines may be found here.


The annotation process of the CodiEsp corpus was carried out in collaboration with terminology experts. Annotators followed an iterative process of training until a high inter-annotator agreement (IAA) was reached. The final, inter-annotator agreement measure obtained for this corpus was calculated on a set of 50 records that were double annotated (blinded) by two different expert annotators, reaching a pairwise agreement of 80.5% on the annotated codes.

Documents were coded with the 2018 version of CIE10 (the Spanish official version of ICD10-Clinical Modification and ICD10-Procedures) and inspired by the “Manual de Codificación CIE-10-ES Diagnósticos 2018” and the “Manual de Codificación CIE-10-ES Procedimientos 2018” provided by the Spanish Ministry of Health. There are two types of CIE10 codes: Diagnóstico and Procedimiento.

There were differences in the annotation of Diagnóstico and Procedimiento codes:

  • Diagnóstico. ICD10-Diagnostico (equivalent to ICD10-CM) terminology is a tree-shaped terminology. Codes with a greater number of characters are more granular. Annotated codes are minimum 3-character long.
  • Procedimiento. ICD10-Procedimiento (equivalent to ICD10-PCS) terminology is an axial terminology with 7 axes. Then, every Procedimiento code has 7 characters. In a real-world scenario, coders can use context information (other documentation, ask healthcare professionals involved in the case, etc) to obtain enough information to fill the 7 characters. This is not the case in clinical cases. Therefore, codes with only 4 characters (the first 4 axes) are allowed.

A complete list of relevant Diagnóstico and Procedimiento codes for the task is available at Zenodo.

The annotation of the entire set of entity mentions was carried out by terminology experts. Moreover, technical assistance in terms of the annotation interface was offered during the entire corpus development process.

Annotation guidelines may be found here.