Data – PharmaCoNER

Annotation Guidelines

The annotation process of the SPACCC chemical and drug corpus was inspired by previous annotation schemes and corpora used for the BioCreative CHEMDNER and GPRO tracks, translating the guidelines used for these tracks into Spanish and adapting them to the specificities and needs of more clinically oriented documents by modifying the annotation scheme and rules to cover medical information needs. This adaptation was carried out in collaboration with practicing physicians and medicinal chemistry experts. The adaptation, translation and refinement of the guidelines was carried out on a sample set of the SPACCC corpus and connected to an iterative process of annotation consistency analysis through inter-annotator agreement (IAA) calculation until a high annotation quality on terms of IAA was reached. A link to the final version of the used 34 pages annotation guidelines can be downloaded from here.

This iterative refinement process required direct interaction between the expert annotators in order to resolve discrepancies, using a side-by-side visualization of clinical case document annotations with the high lightened discrepancies and going over each of them in order to solve doubts, add rules and add clarifying examples to the guidelines. A common aspect that needed to be clarified was the annotation in the first sample annotation cycle of therapeutic application types that actually did not correspond to a chemical entity per se. The final, inter-annotator agreement measure obtained for this corpus was calculated on a set of 50 records that were double annotated (blinded) by two different expert annotators, reaching a pairwise agreement of 93% on the exact entity mention comparison level and 76% agreement when also the entity concept normalization was taken into account. Entity normalization was carried out primarily against the SNOMED-CT knowledgebase. Note that there is SNOMED CT version directly released by the Spanish Ministry of Health.

The manual annotation of the entire corpus was carried out in a three-step approach. First, an initial annotation process was carried out on an adapted version of the AnnotateIt tool. The resulting annotations were then exported, trailing whitespaces were removed and double annotations of the same string were send as an alert to the human annotators for revision/correction. Then, the annotations were uploaded into the BRAT annotation tool, which was slightly less efficient for mention labeling in terms of manual annotation speed. The human annotators performed then a final revision of the annotation in BRAT, to correct mistakes and to add missing annotation mentions. Finally, the senior annotator did a last round of annotation revision of the entire corpus.

The annotation of the entire set of entity mentions was carried out by medicinal chemistry experts, who in case of doubts regarding more clinically related concept mentions consulted the directly the practicing physicians collaborating with this annotation project. Moreover, technical assistance in terms of the annotation interface was offered during the entire corpus development process. Figure 3 shows the previous example case together with its corresponding textual annotation in the BRAT format and the clarification of the normalization of one specific entity mention.

Figure 3: An example of SPACCC annotation visualization together with its corresponding annotation in BRAT format, highlighting the normalization of entity mentions to SNOMED-CT, in this case of the antibiotic “Ciprofloxacino”.

Description of the Corpus

For this task we have prepared a manually classified collection of clinical case sections derived from Open access Spanish medical publications, named the Spanish Clinical Case Corpus (SPACCC). All clinical case records derived from various databases were gathered in a first step, preprocessed and the actual clinical case section was extracted removing embedded figure references or citations. These records where classified manually using the MyMiner file labeling online application by a practicing oncologist and revised by a clinical documentalist in order to assure that these records were related to the medical domain and they resembled the kind of structure and content that is relevant to process clinical content. During this process, clinical cases from other fields like psychology, historical forensics, some very particular cases of epidemiology studies or clinical case series not focused on a single patient/clinical case were removed. The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. The SPACCC corpus contains a total of 396,988 words, with an average of 396.2 words per clinical case. It is noteworthy to say that this kind of narrative shows properties of both, the biomedical and medical literature as well as clinical records. Moreover the clinical cases were not restricted to a single medical discipline, and thus cover a variety of medical topics, including oncology, urology, cardiology, pneumology or infections diseases, which is key in order to cover a diverse collection of chemicals and drugs. The SPACCC corpus will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file.

Figure 1. An example of SPACCC annotation visualized using the BRAT annotation interface.

The total number of labeled mentions of the last version of the SPACCC corpus is of 6,931. They do correspond to a more granular annotation scheme covering four different mention types directly proposed through consensus by both clinician and medicinal chemistry basic researches. Figure 1 shows a screenshot of a clinical case annotated using the BRAT interface. The overall annotation statistics are:

Entity type 1: 4,426 mentions of chemicals that can be manually normalized to a unique concept identifier (primarily SNOMED-CT), having the tag “NORMALIZABLES”.
Entity type 2: 55 mentions of chemicals that could not be normalized manually to a unique concept identifier, having the label “NO_NORMALIZABLES”.
Entity type 3: 2,291 mentions of proteins and genes following an adaptation of the BioCreative GPRO track annotation guidelines, having the label “PROTEINAS”. This class includes also peptides, peptide hormones and antibodies.
Entity type 4: 159 cases of general substance class mentions of clinical and biomedical relevance, including certain pharmaceutical formulations, general treatments, chemotherapy programs, vaccines and a predefined set of general substances (e.g.: Estragón, Silimarina, Bromelaína, Melanina, Vaselina, Lanolina, Alcohol, Tabaco, Marihuana, Cannabis, Opio and Gluten). Mentions of this class were labeled as “UNCLEAR” and will not be part of the entities evaluated by this track, but serve as additional annotations of medical relevance.

Currently the annotation format used is based on BRAT, but the distribution of this corpus also in other formats like PubAnnotation, BioC, or JSON will also be explored.

The entire SPACCC corpus will be randomly sampled into three subsets, the training, development and test set. The training and development set will comprise a total of 350 records each, while the test set, which will be used for evaluation purposes of participating teams will consist of a total of 300 records. Together with the test set release we plan to add an additional collection of 2,000 documents (background set) to make sure that participating teams will not be able to do manual corrections and also that these systems are able to scale to larger data collections.

Datasets

The PharmaCoNER corpus has been randomly sampled into three subset: the train, the development, and the test set. The training set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Sample set

The sample set is composed of 15 clinical cases extracted from the training set. This sample set is also included in the evaluation script (see Resources). Download the sample set from here.

Train set

The train set is composed of 500 clinical cases. Download the train set from here.

Development set

The Development set is composed of 250 clinical cases. Download the development set from here.

Background set

The background set is composed of 2,751 clinical cases. It is distributed in plain text format. Download the background set from here.

Test set with Gold Standard annotations

The Test set is with Gold Standard annotations is composed of 250 clinical cases. Download the Test set with Gold Standard annotations from here.