IberEval 2017 - Biomedical Abbreviation Recognition and Resolution (BARR)

Additional resources

In order to carry out the BARR track we will release the BARR corpus, consisting in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt as well as using the Markyt annotation system. This corpus comprises a selection of medical abstracts. The BARR corpus is structured into a training, development and test set, each consisting of a total of 1000 abstracts with their corresponding <Abbreviation, Definition> offset annotations annotated an annotation team of three domain experts.

For evaluation purposes participating teams have to recognize short form-long form pairs co-occurring within sentences consisting of:

Short Forms (SFs): a shorter term that denotes a longer word or phrase.
Long Forms (LFs): refers to the corresponding definition found in the same sentence as the SF.

A large collection of unlabeled medical abstracts written primarily in Spanish, the BARR background set will be released together with the BARR corpus. Moreover several lexical resources and a collection of pointers to existing tools will be released as well.

Here's an example of an annotated abstract. A short form (ATM) is annotated, together with the long form (articulación temporomandibular). We can find the short form again in the abstract, so we mark it as "Multiple":

A list of additional resources for this track is provided below. For this track we will release a manually annotated set of medical articles consisting of a sample set, training set, development and test set.

Abbreviation/acronym taggers:

Ab3P: another great abbreviation detection tool.
BADREX: a GATE plugin that identifies term-abbreviation pairs using dynamic regular expressions.
ExtractAbbrev: a very popular tool to detect abbreviations, developed by Schwartz and Hearst.
MetaMap: a tool for recognizing medical concepts in text. It also detects abbreviations and disambiguates them (UMLS license is required for its use).

Datasets and lexical resources:

Allie: a search service for abbreviations and long forms utilized in Lifesciences.
IATE: InterActive Terminology for Europe.
ITRT: The International Thesaurus of Refugee Terminology.
MediLexicon: world's largest online database of pharmaceutical and medical abbreviations.
UNESCOTERM: the UNESCO terminology database.

Resources of spanish biomedical acronyms and abbreviations:

NLP software:

Apache OpenNLP: a machine learning based toolkit for the processing of natural language text.
Freeling: a C++ library providing language analysis functionalities.
IXA Pipes: a modular set of NLP tools which provides easy access to NLP technology.
Stanford CoreNLP: a very popular set of natural language analysis tools.

Biomedical NLP software:

Apache cTakes: an open-source tool to extract information from medical records.
GENIAtagger: part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text.
MedTagger: a biomedical named entity recognizer and relation extractor.
MetaMap: a tool for recognizing medical concepts in text.

Sentence splitters (most NLP softwares, like the mentioned ones above, have sentence splitters integrated):

CCG Sentence Segmentation Tool: this tool reads plain text and rewrites it with one sentence per line.
GENIA sentence splitter: a sentence splitter optimized for biomedical texts.
splitta: very high accuracy sentence boundary detection (English only).

Machine translation software (recommended to translate titles and abstracts from English to Spanish):

Apertium: An open-source rule-based machine translation platform
OpenNMT: open-source (MIT) neural machine translation system.
More machine translation tools.

Machine Learning:

Apache Spark MLlib: Apache Spark's scalable machine learning library.
Google Tensorflow: a software library for numerical computation using data flow graphs.
Scikit-learn: a simple and efficient tools for data mining and data analysis
Weka: a collection of machine learning algorithms for data mining tasks.

News

IMPORTANT UPDATE! June 30th, 2017: LaTeX template and Easychair for the submission paper are available at the evaluation site.
IMPORTANT UPDATE! June 21st, 2017: Test set is finally available. Download the Zip file.
IMPORTANT UPDATE! June 20th, 2017: Deadlines extended. The test will be available tomorrow morning!!
IMPORTANT UPDATE! June 16th, 2017: We finally made the 2 background sets public, with a total of 112728 and 154488 documents. 600 of those documents will be part of the test set. You can find them in the datasets site.
IMPORTANT UPDATE! June 7th, 2017: We updated the results of the baselines and their predictions files. You can find them in the datasets site.
UPDATE! May 31th, 2017: Training subset 2 is available now. Download it from the datasets site or Markyt.
UPDATE! May 25th, 2017: Training subset 1 is available now. Download it from the datasets site or Markyt.
May 5th, 2017: Training set release has been delayed. We will try to release it as soon as possible. Sorry for all the inconvenience.
March 28th, 2017: BARR track registration is now open. Register here. You can download the sample set after registering.
March 10th, 2017: BARR track first collection additional resources published.
March 6th, 2017: BARR track website launched.

Contact

Martin Krallinger, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain

Committee

Martin Krallinger, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Santiago de la Peña, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Ander Intxaurrondo, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Jesús Santamaría, Biological Text Mining Unit (Bio-TeMUC), CNIO, Spain
Jose A. Lopez-Martin, Medical Oncology, Hospital 12 de Octubre, Spain
Alfonso Valencia, Life Sciences & Computational Biology, BSC, Spain
Marta Villegas, Life Sciences & Computational Genomics, BSC, Spain
Anália Lourenço, Next Generation Computer Systems Group, University of Vigo, Spain
Gael Perez, Next Generation Computer Systems Group, University of Vigo, Spain
Martin Perez, Next Generation Computer Systems Group, University of Vigo, Spain