Description of the Corpus

Annotation guidelines will be available in Zenodo.
DISTEMIST training, test and background set is available at Zenodo.

This page contains the following information:

DISTEMIST corpus format
DISTEMIST corpus general information
DISTEMIST Multilingual corpus

Corpus format

DISTEMIST-entities. Annotations are stored tab-separated file with headers and 6 columns:

filename: document name
mark: identifier mention id
label: mentions type (ENFERMEDAD)
off0: starting position of the mention in the document
off1: ending position of the mention in the document
span: text span

`Figure 1. Example annotation for DISTEMIST-entities`

DISTEMIST-linking. Annotations are stored tab-separated file with headers and 8 columns: filename: document name

mark: identifier mention id
label: mentions type (ENFERMEDAD)
off0: starting position of the mention in the document
off1: ending position of the mention in the document
span: text span
codes: List of Snomed-CT concept codes linked to the mention. If there is more than one code associated with a mention, they will be concatenated by the symbol “+”.
semantic relation: the relationship between the assigned code and the mention. It can be EXACT, when the code corresponds exactly with the mention, or NARROW, when the mention corresponds to a narrower concept than the Snomed-CT code. For instance, the concept “Chorioretinal lacunae” does not exist in Snomed-CT. Then, it is normalized to the Snomed-CT ID 302893000 (“Chorioretinal disorder”).

Figure 1. Example annotation for DISTEMIST-linking

The raw clinical case documents are distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file.

General information

The DISTEMIST corpus is a collection of 1,000 clinical cases in Spanish from different medical specialties such as cardiology, oncology, otorhinolaryngology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology, and urology annotated with disease mentions. Each of the mentions in the corpus has been standardized using SNOMED-CT terminology.

All clinical case records derived from various databases were gathered in a first step, preprocessed and the actual clinical case section was extracted removing embedded figure references or citations. These records were classified manually using the MyMiner file labeling online application by a practicing oncologist and revised by a clinical documentalist in order to assure that these records were related to the medical domain and they resembled the kind of structure and content that is relevant to process clinical content. During this process, clinical cases from other fields like psychology, historical forensics, some very particular cases of epidemiology studies, or clinical case series not focused on a single patient/clinical case were removed. The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case.

The disease mention annotation was done by clinical experts using the BRAT annotation tool following well-defined annotation guidelines, defined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset.

The corpus was randomly split into two subsets: training and test set. The test set will be used for evaluation purposes of participating teams and will consist of a total of 250 records.