Annotation guidelines will be available in Zenodo.

DISTEMIST training, test and background set is available at Zenodo.


The training dataset consists of 750 annotated clinical cases. The annotations can be accessed via a tab-separated file with the following fields:

  • filename: document name
  • mark: identifier mention id
  • label: mention type (ENFERMEDAD)
  • off0: starting position of the mention in the document
  • off1: ending position of the mention in the document
  • span:  text span
  • codes: List of Snomed-CT concept codes linked to the mention. If there is more than one code associated with a mention, they will be concatenated by the symbol “+”.
  • semantic relation: the relationship between the assigned code and the mention. It can be EXACT, when the code corresponds exactly with the mention, or NARROW, when the mention corresponds to a narrower concept than the Snomed-CT code. For instance, the concept “Chorioretinal lacunae” does not exist in Snomed-CT. Then, it is normalized to the Snomed-CT ID 302893000 (“Chorioretinal disorder”).

In addition, txt files will be provided for each of the clinical cases in order to access the context of each mention and to train the automatic system.


The test dataset consists of 200 unannotated clinical cases.

It is published together with a larger collection of 2800 background clinical cases, to avoid manual corrections. You have to make predictions for the 3000 clinical cases and you will be evaluated in the 200 that belong to the test set.