PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track
SEAD – Plan TL Sponsoring the PharmaCoNER Task Awards for track winners:
There is a prize for both sub-tracks: 1,000€ to each
About the task
Efficient access to mentions of drugs, medications and chemical entities is a pressing need shared by biomedical researchers, clinicians
The critical importance of chemical and drug name recognition motivated several-shared tasks in the past, such as the CHEMDNER tracks or the i2b2 medication challenge. However, currently, most of the BioNLP as well as clinical NLP research is being done on English documents, and only few tasks have been carried out on non-English texts, or were multilingual tracks. Nonetheless, it is important to note that there is also a considerable amount of biomedically relevant content published in other languages than English and particularly clinical texts are entirely written in the native language of each country, with a few exceptions.
Following the outline of previous chemical and drug NER efforts, in particular the BioCreative CHEMDNER tracks, we organize the first task on chemical and drug mention recognition from Spanish medical texts, namely from a corpus of Spanish clinical case studies. Thus, this task will address the automatic extraction of chemical, drug, gene/protein mentions from clinical case studies written in Spanish. The main aim is to promote the development of named entity recognition tools of practical relevance, that is chemical and drug mentions in non-English content, determining the current-state-of-the art, identifying challenges and comparing the strategies and results to those published for English data.
For this task we have prepared a manually classified collection of clinical case sections derived from Open access Spanish medical publications, named the Spanish Clinical Case Corpus (SPACCC). The corpus contains a total of 1000 clinical cases / 396,988 words. It is noteworthy to say that this kind of narrative shows properties of both, the biomedical and medical literature as well as clinical records.
The annotation of the entire set of entity mentions was carried out by medicinal chemistry experts and it includes the following four entity types:
– “NORMALIZABLES”: mentions of chemicals that can be manually normalized to a unique concept identifier (primarily SNOMED-CT).
– “NO_NORMALIZABLES”: mentions of chemicals that could not be normalized manually to a unique concept identifier.
– “PROTEINAS”: mentions of proteins and genes following an adaptation of the BioCreative GPRO track annotation guidelines. This class includes also peptides, peptide hormones and antibodies.
– “UNCLEAR”: cases of general substance class mentions of clinical and biomedical relevance, including certain pharmaceutical formulations, general treatments, chemotherapy programs, vaccines and a predefined set of general substances (e.g.: Estragón, Silimarina, Bromelaína, Melanina, Vaselina, Lanolina, Alcohol, Tabaco, Marihuana, Cannabis, Opio and Gluten). Mentions of this class will not be part of the entities evaluated by this track, but serve as additional annotations of medical relevance.
Evaluation of automatic predictions for this task will have two different scenarios or tracks:
1) Track 1: NER offset and entity classification.
2) Track 2: Concept indexing.