Corpus Description

About the texts

The MedProcNER corpus is a collection of 1,000 clinical case reports in Spanish annotated with clinical procedure mentions and normalized to SNOMED CT. The texts belong to the SPACCC corpus and are the same ones used in DisTEMIST, making the annotations complementary for medical entity recognition. For this reason, an alternative name for MedProcNER is ProcTEMIST.

Clinical case reports are a type of textual genre in the field of medicine that describe a patient’s medical history, symptoms, diagnosis, and treatment in detail. The case reports included in the SPACCC corpus were manually selected by a clinician for their similarity to real clinical texts in terms of structure and content. Texts from different medical specialties such as cardiology, oncology, otorhinolaryngology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology, and urology are included. The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case.

About the annotations

MedProcNER was manually annotated by clinical experts using the brat annotation tool following well-defined annotation guidelines, defined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset.

The corpus contains a total of almost 10,000 annotations of clinical procedures, with most of them being normalized to SNOMED CT.

More information on the annotation and normalization (and their inter-annotator agreement) is available in the Annotation Guidelines page.

About the format

MedProcNER will be made available on different formats (brat, .TSV, JSON). More info coming soon.