IberEval 2017 - Biomedical Abbreviation Recognition and Resolution (BARR)

Description

There is an increasing interest in automatic text processing of biomedical documents such as the literature, electronic health record, drug labels or medicinal chemistry patents. A common problem across all of these documents is the recognition and resolution of abbreviations, acronyms and symbols, a key issue not only for text indexing algorithms but also for recognition of biomedical named entity recognition such as genes, chemicals, diseases procedures and symptoms. This topic was widely researched in English and is reviewed by Torii et al. (2007) [1]. Among the used strategies to address this problem are alignment-based approaches described by Schwartz and Hearst (2003) [2], machine learning approaches tested by Chang et al. (2002) [3] , or rule-based approaches used by Ao and Takagi (2005) [4]. Lexical resources for biomedical abbreviations in Spanish include for instance the “Diccionario de Siglas Médicas” of the Ministerio de Sanidad y Consumo of Spain [5]. In case of English biomedical texts, several manually annotated corpora have been constructed, i.e. the MEDSTRACT, Ab3P, BOADI and Schwartz and Hearst corpora (see Islamaj Doğan et al., 2014 for more details) [6,7].

The proposed Biomedical Abbreviation Recognition and Resolution (BARR) track has the aim to promote the development and evaluation of biomedical abbreviation identification systems by providing Gold Standard training, development and test corpora manually annotated by domain experts with abbreviation-definition pairs within abstracts of biomedical documents written in Spanish.

BARR track asks participating teams to test processing text systems that are able to detect explicit occurrences of abbreviation-definition pairs.

An illustrative example case can be found in the following sentence:

Se describe la relación entre diferentes factores de riesgo cardiovasculares (FRCV ) y la obesidad a partir de una muestra representativa de la población adulta de Madrid

In this example we consider ‘FRVC’ to be the abbreviation, and ‘factores de riesgo cardiovasculares’ is its definition. The BARR track requires both the recognition of <Abbreviation, Definition> candidate pairs from sentences, and identification of exact string boundaries.

In line with some of the previously proposed resources we refer to an abbreviation as a ShortForm (SF) that is a shorter term that denotes a longer word or phrase. On the other hand, the definition (the LongForm, LF) refers to the corresponding definition found in the same sentence as the SF.

The expected outcome of this task includes strategies and resources for the development and evaluation of abbreviation recognition methods in Spanish biomedical documents. The expected target audience of this task includes developers of natural language processing tools interested in the application of such resources to biomedical documents. Moreover the generated resources will result in valuable resources for the disambiguation and grounding of abbreviations in health related document collections and might serve as a sense inventory for medical abbreviations medical dictionary resources.

In order to carry out this task we will release the BARR corpus, consisting in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt as well as using the Markyt annotation system [8]. This corpus comprises a selection of medical abstracts constraint to particular publication types including clinical case studies, articles, clinical guidelines, protocols and systematic reviews. The BARR corpus is structured into a training, development and test set, each consisting of a total of 1000 abstracts with their corresponding <Abbreviation, Definition> offset annotations annotated an annotation team of three domain experts. The primary evaluation metric used for the BARR track will consist in micro-average F-measure. A larger background set of additional unlabeled 10.000 Spanish medical abstract will be released together with the BARR corpus.

Tentative dates:

March 28th, 2017: Release of sample data
May 25th, 2017: Release of Training Subset 1
May 30th, 2017: Release of Training Subset 2
June 21th, 2017: Release of Test Set
June 26th, 2017: Submission of participant runs
July 1st, 2017: Working notes submission due (short system description 3-5 pages)
July 3rd, 2017: Reviews of Working notes sent out to authors
July 7th, 2017: Deadline to submit Camera ready revised Working notes
September 19th, 2017: Workshop at SEPLN 2017

References:

Torii, M.; Hu, Z.-z.; Song, M.; Wu, C. H. & Liu, H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC bioinformatics, 2007, 8 Suppl 9, S5
Schwartz, A. S. & Hearst, M. A. A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2003, 451-462
Chang, J. T.; Schütze, H. & Altman, R. B. Creating an online dictionary of abbreviations from MEDLINE. Journal of the American Medical Informatics Association : JAMIA, 2002, 9, 612-620
Ao, H. & Takagi, T. ALICE: an algorithm to extract abbreviations from MEDLINE. Journal of the American Medical Informatics Association : JAMIA, 2005, 12, 576-586
Laguna, J. Y.; Cuñat, V.A. Diccionario de Siglas Médicas. Ministerio de Sanidad y Consumo.
Islamaj Doğan, R.; Comeau, D. C.; Yeganova, L. & Wilbur, W. J. Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC- formatted corpora. Database : the journal of biological databases and curation, 2014, 2014
Sohn, S.; Comeau, D. C.; Kim, W. & Wilbur, W. J. Abbreviation definition identification based on automatic precision estimates. BMC bioinformatics, 2008, 9, 402
Pérez-Pérez, M., Pérez-Rodríguez, G., Rabal, O., Vazquez, M., Oyarzabal, J., Fdez-Riverola, F., ... & Lourenço, A. The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge. Database, 2016, baw120.

News

IMPORTANT UPDATE! June 30th, 2017: LaTeX template and Easychair for the submission paper are available at the evaluation site.
IMPORTANT UPDATE! June 21st, 2017: Test set is finally available. Download the Zip file.
IMPORTANT UPDATE! June 20th, 2017: Deadlines extended. The test will be available tomorrow morning!!
IMPORTANT UPDATE! June 16th, 2017: We finally made the 2 background sets public, with a total of 112728 and 154488 documents. 600 of those documents will be part of the test set. You can find them in the datasets site.
IMPORTANT UPDATE! June 7th, 2017: We updated the results of the baselines and their predictions files. You can find them in the datasets site.
UPDATE! May 31th, 2017: Training subset 2 is available now. Download it from the datasets site or Markyt.
UPDATE! May 25th, 2017: Training subset 1 is available now. Download it from the datasets site or Markyt.
March 28th, 2017: BARR track registration is now open.
May 5th, 2017: Training set release has been delayed. We will try to release it as soon as possible. Sorry for all the inconvenience.
March 28th, 2017: BARR track registration is now open. Register here. You can download the sample set after registering.
March 10th, 2017: BARR track first collection additional resources published.
March 6th, 2017: BARR track website launched.

Contact

Martin Krallinger, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain

Committee

Martin Krallinger, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Santiago de la Peña, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Ander Intxaurrondo, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Jesús Santamaría, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Jose A. Lopez-Martin, Medical Oncology, Hospital 12 de Octubre, Spain
Alfonso Valencia, Life Sciences & Computational Biology, BSC, Spain
Marta Villegas, Life Sciences & Computational Genomics, BSC, Spain
Anália Lourenço, Next Generation Computer Systems Group, University of Vigo, Spain
Gael Perez, Next Generation Computer Systems Group, University of Vigo, Spain
Martin Perez, Next Generation Computer Systems Group, University of Vigo, Spain