Description


There is an increasing interest in automatic text processing of biomedical documents such as the literature, electronic health record, drug labels or medicinal chemistry patents. A common problem across all of these documents is the recognition and resolution of abbreviations, acronyms and symbols, a key issue not only for text indexing algorithms but also for recognition of biomedical named entity recognition such as genes, chemicals, diseases procedures and symptoms. This topic was widely researched in English and is reviewed by Torii et al. (2007) [1]. Among the used strategies to address this problem are alignment-based approaches described by Schwartz and Hearst (2003) [2], machine learning approaches tested by Chang et al. (2002) [3] , or rule-based approaches used by Ao and Takagi (2005) [4]. Lexical resources for biomedical abbreviations in Spanish include for instance the “Diccionario de Siglas Médicas” of the Ministerio de Sanidad y Consumo of Spain [5]. In case of English biomedical texts, several manually annotated corpora have been constructed, i.e. the MEDSTRACT, Ab3P, BOADI and Schwartz and Hearst corpora (see Islamaj Doğan et al., 2014 for more details) [6,7].

The proposed Biomedical Abbreviation Recognition and Resolution (BARR) track has the aim to promote the development and evaluation of biomedical abbreviation identification systems by providing Gold Standard training, development and test corpora manually annotated by domain experts with abbreviation-definition pairs within abstracts of biomedical documents written in Spanish.

BARR track asks participating teams to test processing text systems that are able to detect explicit occurrences of abbreviation-definition pairs.

An illustrative example case can be found in the following sentence:

Se describe la relación entre diferentes factores de riesgo cardiovasculares (FRCV ) y la obesidad a partir de una muestra representativa de la población adulta de Madrid

In this example we consider ‘FRVC’ to be the abbreviation, and ‘factores de riesgo cardiovasculares’ is its definition. The BARR track requires both the recognition of <Abbreviation, Definition> candidate pairs from sentences, and identification of exact string boundaries.

In line with some of the previously proposed resources we refer to an abbreviation as a ShortForm (SF) that is a shorter term that denotes a longer word or phrase. On the other hand, the definition (the LongForm, LF) refers to the corresponding definition found in the same sentence as the SF.

The expected outcome of this task includes strategies and resources for the development and evaluation of abbreviation recognition methods in Spanish biomedical documents. The expected target audience of this task includes developers of natural language processing tools interested in the application of such resources to biomedical documents. Moreover the generated resources will result in valuable resources for the disambiguation and grounding of abbreviations in health related document collections and might serve as a sense inventory for medical abbreviations medical dictionary resources.

In order to carry out this task we will release the BARR corpus, consisting in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt as well as using the Markyt annotation system [8]. This corpus comprises a selection of medical abstracts constraint to particular publication types including clinical case studies, articles, clinical guidelines, protocols and systematic reviews. The BARR corpus is structured into a training, development and test set, each consisting of a total of 1000 abstracts with their corresponding <Abbreviation, Definition> offset annotations annotated an annotation team of three domain experts. The primary evaluation metric used for the BARR track will consist in micro-average F-measure. A larger background set of additional unlabeled 10.000 Spanish medical abstract will be released together with the BARR corpus.

Tentative dates:

  • March 28th, 2017: Release of sample data
  • May 25th, 2017: Release of Training Subset 1
  • May 30th, 2017: Release of Training Subset 2
  • June 21th, 2017: Release of Test Set
  • June 26th, 2017: Submission of participant runs
  • July 1st, 2017: Working notes submission due (short system description 3-5 pages)
  • July 3rd, 2017: Reviews of Working notes sent out to authors
  • July 7th, 2017: Deadline to submit Camera ready revised Working notes
  • September 19th, 2017: Workshop at SEPLN 2017

References:

  1. Torii, M.; Hu, Z.-z.; Song, M.; Wu, C. H. & Liu, H. A comparison study on algorithms of detecting long forms for short forms in biomedical text. BMC bioinformatics, 2007, 8 Suppl 9, S5
  2. Schwartz, A. S. & Hearst, M. A. A simple algorithm for identifying abbreviation definitions in biomedical text. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2003, 451-462
  3. Chang, J. T.; Schütze, H. & Altman, R. B. Creating an online dictionary of abbreviations from MEDLINE. Journal of the American Medical Informatics Association : JAMIA, 2002, 9, 612-620
  4. Ao, H. & Takagi, T. ALICE: an algorithm to extract abbreviations from MEDLINE. Journal of the American Medical Informatics Association : JAMIA, 2005, 12, 576-586
  5. Laguna, J. Y.; Cuñat, V.A. Diccionario de Siglas Médicas. Ministerio de Sanidad y Consumo.
  6. Islamaj Doğan, R.; Comeau, D. C.; Yeganova, L. & Wilbur, W. J. Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC- formatted corpora. Database : the journal of biological databases and curation, 2014, 2014
  7. Sohn, S.; Comeau, D. C.; Kim, W. & Wilbur, W. J. Abbreviation definition identification based on automatic precision estimates. BMC bioinformatics, 2008, 9, 402
  8. Pérez-Pérez, M., Pérez-Rodríguez, G., Rabal, O., Vazquez, M., Oyarzabal, J., Fdez-Riverola, F., ... & Lourenço, A. The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge. Database, 2016, baw120.