For this task we have prepared a manually classified collection of clinical case sections derived from Open access Spanish medical publications, named the Spanish Clinical Case Corpus (SPACCC). All clinical case records derived from various databases were gathered in a first step, preprocessed and the actual clinical case section was extracted removing embedded figure references or citations. These records where classified manually using the MyMiner file labeling online application by a practicing oncologist and revised by a clinical documentalist in order to assure that these records were related to the medical domain and they resembled the kind of structure and content that is relevant to process clinical content. During this process, clinical cases from other fields like psychology, historical forensics, some very particular cases of epidemiology studies or clinical case series not focused on a single patient/clinical case were removed. The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. The SPACCC corpus contains a total of 396,988 words, with an average of 396.2 words per clinical case. It is noteworthy to say that this kind of narrative shows properties of both, the biomedical and medical literature as well as clinical records. Moreover the clinical cases were not restricted to a single medical discipline, and thus cover a variety of medical topics, including oncology, urology, cardiology, pneumology or infections diseases, which is key in order to cover a diverse collection of chemicals and drugs. The SPACCC corpus will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file.
The total number of labeled mentions of the last version of the SPACCC corpus is of 6,931. They do correspond to a more granular annotation scheme covering four different mention types directly proposed through consensus by both clinician and medicinal chemistry basic researches. Figure 1 shows a screenshot of a clinical case annotated using the BRAT interface. The overall annotation statistics are:
Entity type 1: 4,426 mentions of chemicals that can be manually normalized to a unique concept identifier (primarily SNOMED-CT), having the tag “NORMALIZABLES”.
Entity type 2: 55 mentions of chemicals that could not be normalized manually to a unique concept identifier, having the label “NO_NORMALIZABLES”.
Entity type 3: 2,291 mentions of proteins and genes following an adaptation of the BioCreative GPRO track annotation guidelines, having the label “PROTEINAS”. This class includes also peptides, peptide hormones and antibodies.
Entity type 4: 159 cases of general substance class mentions of clinical and biomedical relevance, including certain pharmaceutical formulations, general treatments, chemotherapy programs, vaccines and a predefined set of general substances (e.g.: Estragón, Silimarina, Bromelaína, Melanina, Vaselina, Lanolina, Alcohol, Tabaco, Marihuana, Cannabis, Opio and Gluten). Mentions of this class were labeled as “UNCLEAR” and will not be part of the entities evaluated by this track, but serve as additional annotations of medical relevance.
Currently the annotation format used is based on BRAT, but the distribution of this corpus also in other formats like PubAnnotation, BioC, or JSON will also be explored.
The entire SPACCC corpus will be randomly sampled into three subsets, the training, development and test set. The training and development set will comprise a total of 350 records each, while the test set, which will be used for evaluation purposes of participating teams will consist of a total of 300 records. Together with the test set release we plan to add an additional collection of 2,000 documents (background set) to make sure that participating teams will not be able to do manual corrections and also that these systems are able to scale to larger data collections.