Corpus Description

About the texts

The SympTEMIST corpus is a collection of 1,000 clinical case reports in Spanish annotated with symptoms, signs and findings mentions and normalized to SNOMED CT. The texts belong to the SPACCC corpus and are the same ones used in DisTEMIST and MedProcNER, making the annotations complementary for medical entity recognition.

Clinical case reports are a type of textual genre in the field of medicine that describe a patient’s medical history, symptoms, diagnosis, and treatment in detail. The case reports included in the SPACCC corpus were manually selected by a clinician for their similarity to real clinical texts in terms of structure and content. Texts from different medical specialties such as cardiology, oncology, otorhinolaryngology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology, and urology are included. The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case.

About the annotations

SympTEMIST was manually annotated by multiple clinical experts using the brat annotation tool following well-defined annotation guidelines, defined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset.

The corpus contains a total of 12,196 annotations of symptoms, signs and findings, with most of them being normalized to SNOMED CT.

More information on the annotation and normalization (and their inter-annotator agreement) is available in the Annotation Guidelines page.

About the format

SympTEMIST is made available on different formats (brat, .TSV). For more information on brat’s format please visit: https://brat.nlplab.org/standoff.html. The .TSV file columns are explained in the corpus’ attached README file.