The annotation process of the SPACCC chemical and drug corpus was inspired by previous annotation schemes and corpora used for the BioCreative CHEMDNER and GPRO tracks, translating the guidelines used for these tracks into Spanish and adapting them to the specificities and needs of more clinically oriented documents by modifying the annotation scheme and rules to cover medical information needs. This adaptation was carried out in collaboration with practicing physicians and medicinal chemistry experts. The adaptation, translation and refinement of the guidelines was carried out on a sample set of the SPACCC corpus and connected to an iterative process of annotation consistency analysis through inter-annotator agreement (IAA) calculation until a high annotation quality on terms of IAA was reached. A link to the final version of the used 34 pages annotation guidelines can be downloaded from here.

This iterative refinement process required direct interaction between the expert annotators in order to resolve discrepancies, using a side-by-side visualization of clinical case document annotations with the high lightened discrepancies and going over each of them in order to solve doubts, add rules and add clarifying examples to the guidelines. A common aspect that needed to be clarified was the annotation in the first sample annotation cycle of therapeutic application types that actually did not correspond to a chemical entity per se. The final, inter-annotator agreement measure obtained for this corpus was calculated on a set of 50 records that were double annotated (blinded) by two different expert annotators, reaching a pairwise agreement of 93% on the exact entity mention comparison level and 76% agreement when also the entity concept normalization was taken into account. Entity normalization was carried out primarily against the SNOMED-CT knowledgebase. Note that there is SNOMED CT version directly released by the Spanish Ministry of Health.

The manual annotation of the entire corpus was carried out in a three-step approach. First, an initial annotation process was carried out on an adapted version of the AnnotateIt tool. The resulting annotations were then exported, trailing whitespaces were removed and double annotations of the same string were send as an alert to the human annotators for revision/correction. Then, the annotations were uploaded into the BRAT annotation tool, which was slightly less efficient for mention labeling in terms of manual annotation speed. The human annotators performed then a final revision of the annotation in BRAT, to correct mistakes and to add missing annotation mentions. Finally, the senior annotator did a last round of annotation revision of the entire corpus.

The annotation of the entire set of entity mentions was carried out by medicinal chemistry experts, who in case of doubts regarding more clinically related concept mentions consulted the directly the practicing physicians collaborating with this annotation project. Moreover, technical assistance in terms of the annotation interface was offered during the entire corpus development process. Figure 3 shows the previous example case together with its corresponding textual annotation in the BRAT format and the clarification of the normalization of one specific entity mention.

Figure 3: An example of SPACCC annotation visualization together with its corresponding annotation in BRAT format, highlighting the normalization of entity mentions to SNOMED-CT, in this case of the antibiotic “Ciprofloxacino”.