Annotation guidelines will be available in Zenodo.
DISTEMIST training, test and background sets (including the multilingual corpus and cross-mappings) are available at Zenodo.
DISTEMIST Multilingual corpus
We have generated the annotated (and normalized to Snomed-CT) training and validation sets in 6 languages: English, Portuguese, Catalan, Italian, French, and Romanian. The process was:
- The text files were translated with a neural machine translation system.
- The annotations were translated with the same neural machine translation system.
- The translated annotations were transferred to the translated text files using an annotation transfer technology.
If you want to visualize the multilingual resources, check out this Brat server: https://temu.bsc.es/mDistemist/#/translations/
For instance, you can see the parallel annotations in English vs in French, or in Spanish (the gold standard) vs in Italian.
DISTEMIST cross-mappings
The DISTEMIST Gold Standard contains the mentions mapped to Snomed-CT.
In the DISTEMIST cross-mappings files we include the same entities as in DISTEMIST-linking but mapped to Snomed-CT, MeSH, ICD-10, HPO, and OMIM. The original mappings are manual and to Snomed-CT. The mappings to the other terminologies were done through the UMLS Metathesaurus.