Additional resources
In order to carry out the BARR track we will release the BARR corpus, consisting in a manually labeled collection of Spanish medical abstracts constructed using a customized version of AnnotateIt as well as using the Markyt annotation system. This corpus comprises a selection of medical abstracts. The BARR corpus is structured into a training, development and test set, each consisting of a total of 1000 abstracts with their corresponding <Abbreviation, Definition> offset annotations annotated an annotation team of three domain experts.
For evaluation purposes participating teams have to recognize short form-long form pairs co-occurring within sentences consisting of:
- Short Forms (SFs): a shorter term that denotes a longer word or phrase.
- Long Forms (LFs): refers to the corresponding definition found in the same sentence as the SF.
A large collection of unlabeled medical abstracts written primarily in Spanish, the BARR background set will be released together with the BARR corpus. Moreover several lexical resources and a collection of pointers to existing tools will be released as well.
Here's an example of an annotated abstract. A short form (ATM) is annotated, together with the long form (articulación temporomandibular). We can find the short form again in the abstract, so we mark it as "Multiple":
A list of additional resources for this track is provided below. For this track we will release a manually annotated set of medical articles consisting of a sample set, training set, development and test set.
- Abbreviation/acronym taggers:
- Ab3P: another great abbreviation detection tool.
- BADREX: a GATE plugin that identifies term-abbreviation pairs using dynamic regular expressions.
- ExtractAbbrev: a very popular tool to detect abbreviations, developed by Schwartz and Hearst.
- MetaMap: a tool for recognizing medical concepts in text. It also detects abbreviations and disambiguates them (UMLS license is required for its use).
- Datasets and lexical resources:
- Allie: a search service for abbreviations and long forms utilized in Lifesciences.
- IATE: InterActive Terminology for Europe.
- ITRT: The International Thesaurus of Refugee Terminology.
- MediLexicon: world's largest online database of pharmaceutical and medical abbreviations.
- UNESCOTERM: the UNESCO terminology database.
- Resources of spanish biomedical acronyms and abbreviations:
- Diccionario Babylon - Siglas
- Diccionario de siglas medicas - Sociedad Española de Documentación Médica (SEDOM)
- Glosario siglas médicas
- Siglas médicas (laenfermeria WIKI)
- Wikilengua - Abreviaturas
- NLP software:
- Apache OpenNLP: a machine learning based toolkit for the processing of natural language text.
- Freeling: a C++ library providing language analysis functionalities.
- IXA Pipes: a modular set of NLP tools which provides easy access to NLP technology.
- Stanford CoreNLP: a very popular set of natural language analysis tools.
- Biomedical NLP software:
- Apache cTakes: an open-source tool to extract information from medical records.
- GENIAtagger: part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text.
- MedTagger: a biomedical named entity recognizer and relation extractor.
- MetaMap: a tool for recognizing medical concepts in text.
- Sentence splitters (most NLP softwares, like the mentioned ones above, have sentence splitters integrated):
- CCG Sentence Segmentation Tool: this tool reads plain text and rewrites it with one sentence per line.
- GENIA sentence splitter: a sentence splitter optimized for biomedical texts.
- splitta: very high accuracy sentence boundary detection (English only).
- Machine translation software (recommended to translate titles and abstracts from English to Spanish):
- Apertium: An open-source rule-based machine translation platform
- OpenNMT: open-source (MIT) neural machine translation system.
- More machine translation tools.
- Machine Learning:
- Apache Spark MLlib: Apache Spark's scalable machine learning library.
- Google Tensorflow: a software library for numerical computation using data flow graphs.
- Scikit-learn: a simple and efficient tools for data mining and data analysis
- Weka: a collection of machine learning algorithms for data mining tasks.