Corpus Description

About the texts

The MEDDOPLACE corpus is a collection of 1,000 clinical case reports (635,785 tokens) in Spanish manually selected for their relevance to the task.

Clinical case reports are a type of textual genre in the field of medicine that describe a patient’s medical history, symptoms, diagnosis, and treatment in detail. They are usually written by healthcare providers, such as physicians, nurses, or other medical professionals, and are used to document and share information about a specific patient’s condition. These reports, which are often published in peer-reviewed medical journals, are an important source of information in the field of medicine and are used to contribute to the advancement of medical knowledge and to improve patient care. They are also an important textual source for Natural Language Processing in the clinical domain, as they provide a rich source of medical information in unstructured text format that is similar to real hospital records.

The clinical case reports contained in the corpus belong to different clinical specialties such as psychiatry, neurology, travel medicine, infectious diseases, cardiology, occupational medicine or oncology. This variety helps provide a more diverse and comprehensive collection of locations and related information in medical records.

About the annotations

The MEDDOPLACE corpus was manually annotated by clinical experts following the MEDDOPLACE guidelines, which were created de novo by clinical and linguistic experts. The guidelines were based on a literature review of several location-related corpora, and refined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset using brat. After the annotation was finished, the corpus also underwent a post-processing step to maximize consistency.

All in all, the corpus has a total of almost 10,000 annotations distributed in 10 different labels (GPE_NOM, GPE_GEN, GEO_NOM, GEO_GEN, FAC_NOM, FAC_GEN, DEPARTAMENTO, TRANSPORTE, COMUNIDAD, IDIOMA). Almost all entities in the corpus are normalized, with GPE_NOM and GEO_NOM being normalized to GeoNames, FAC_NOM to PlusCodes and the rest to SNOMED CT. Additionally, all location annotations have been classified in five different clinically-relevant classes.

More information on the annotation, normalization and classification (and their inter-annotator agreement) is available in the Annotation Guidelines page.

About the format

MEDDOPLACE is offered in three different formats (brat, .TSV, .JSON). Please check the README file that accompanies the Gold Standard data for more details (see Downloads page).