Healthcare professionals do face serious time constraints when carrying out their daily workload. Moreover, a considerable fraction of medical and clinical terms, including names of diseases, symptoms or even clinical tests, correspond to long complex phrases. Consequently, abbreviations and acronyms, hereafter referred to as short forms, are widely encountered in clinical reports. This is true even for cases of highly ambiguous short forms that could potentially have many different senses.
Clinical short forms, as a result of their underlying semantic complexity, are problematic. Correct interpretation of short forms is not only difficult for patients, but often represent a challenge even for medical professionals and health documentation experts.
In order to work well, clinical language technologies and information retrieval systems do require prior identification and resolution medical short forms. In practice, short forms are used in clinical texts to denote all kinds of key concepts, including diseases and adverse effects, symptoms, medications, treatments or temporal and anatomical expressions or routes of administration of drugs.
This topic was widely researched in English and is reviewed by Torii et al. (2007) [1]. Among the used strategies to address this problem are alignment-based approaches described by Schwartz and Hearst (2003) [2], machine learning approaches tested by Chang et al. (2002) [3] , or rule-based approaches used by Ao and Takagi (2005) [4].
Lexical resources for biomedical abbreviations in Spanish include for instance the “Diccionario de Siglas Médicas” of the Ministerio de Sanidad y Consumo of Spain [5]. In case of English biomedical texts, several manually annotated corpora have been constructed, i.e. the MEDSTRACT, Ab3P, BOADI and Schwartz and Hearst corpora (see Islamaj Doğan et al., 2014 for more details) [6,7].
To address computationally the medical abbreviation resolution problem in Spanish, there is a pressing need to access clinical corpora annotated with short form information and of short form clinical sense inventories.
The proposed track, the Second Biomedical Abbreviation Recognition and Resolution (BARR2) track relies on the experience and success of the previous BARR track [8], by extending annotations of short forms to clinical text and clinical case studies.
While the previous BARR track was focused on biomedical literature in Spanish, the BARR2 track has the aim to promote the development and evaluation of clinical abbreviation identification systems by providing Gold Standard training and test corpora manually annotated by domain experts with abbreviation-definition pairs within abstracts of clinical texts and clinical case studies written in Spanish.
The BARR2 track will be structured into two subtasks, namely:
An illustrative example case for sub-track 1 can be found in the following sentence:
In this example we consider 'FRVC' to be the abbreviation, and 'factores de riesgo cardiovasculares' is its definition. The BARR2 track requires both the recognition of <Abbreviation,Definition> candidate pairs from sentences, and identification of exact string boundaries.
An illustrative example case for sub-track 2 can be found in the following sentence:
In this example we consider 'CPRE' to be the abbreviation, the actual definition or long form is not provided in the document and would correspond to 'colangio-pancreatografía retrógrada endoscópica.
In line with some of the previously proposed resources we refer to an abbreviation as -a ShortForm (SF)- that is a shorter term that denotes a longer word or phrase. On the other hand, the definition -the LongForm (LF)- refers to the corresponding definition found in the same sentence as the SF.
The expected outcome of this task includes strategies and resources for the development and evaluation of abbreviation recognition methods in Spanish biomedical documents. The expected target audience of this task includes developers of natural language processing tools interested in the application of such resources to biomedical documents. Moreover the generated resources will result in valuable resources for the disambiguation and grounding of abbreviations in health related document collections and might serve as a sense inventory for medical abbreviations medical dictionary resources.
In order to carry out this task we will release the BARR2 corpus, consisting in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt, BRAT as well as using the Markyt annotation system [9]. This corpus comprises a selection of medical abstracts, clinical case studies and clinical record sentences. The BARR2 corpus will be structured into a training and a test set, each manually labeled with their corresponding <Abbreviation,Definition> offset annotations by a team of domain experts. The primary evaluation metric used for the BARR2 track will consist in precision, recall, y f-score of the predictions against manual gold standard. A larger background set of additional automatically labeled Spanish clinical case reports will be released together with the BARR2 corpus.
Ander Intxaurrondo
ander.intxaurrondo[AT]bsc.es