Annotation guidelines will be available in Zenodo.

DISTEMIST training, test and background sets (including the multilingual corpus and cross-mappings) are available at Zenodo.

DISTEMIST Multilingual corpus

We have generated the annotated (and normalized to Snomed-CT) training and validation sets in 6 languages: English, Portuguese, Catalan, Italian, French, and Romanian. The process was:

  1. The text files were translated with a neural machine translation system.
  2. The annotations were translated with the same neural machine translation system.
  3. The translated annotations were transferred to the translated text files using an annotation transfer technology.

If you want to visualize the multilingual resources, check out this Brat server: https://temu.bsc.es/mDistemist/#/translations/
For instance, you can see the parallel annotations in English vs in French, or in Spanish (the gold standard) vs in Italian.

Multilingual annotated and normalized corpus process overview
Gold Standard (Spanish) vs English annotations visualized with Brat.

DISTEMIST cross-mappings

The DISTEMIST Gold Standard contains the mentions mapped to Snomed-CT.

In the DISTEMIST cross-mappings files we include the same entities as in DISTEMIST-linking but mapped to Snomed-CT, MeSHICD-10HPO, and OMIM. The original mappings are manual and to Snomed-CT. The mappings to the other terminologies were done through the UMLS Metathesaurus.