Biomedical Text Mining Unit

OTG Sanidad | Secretaría de Estado para la Agenda Digital

Description

The Biological Text Mining Unit focuses on the application and development of biomedical text mining technologies, which are becoming a key tool for the efficient exploitation of information, contained in unstructured data repositories including the scientific literature, electronic health records (EHRs), patents, biobank metadata, clinical trials and social media. The unit has a particular interest in processing clinical documents written in Spanish and other co-official languages in the area of health-related topics and the integration of molecular and biological information derived from the literature.

The unit is fully funded through the “Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (PITL)”, in the framework of an agreement (“encomienda”) between the Secretary of State of Telecommunications of the Spanish Ministry of Energy, Tourism and the Digital Agenda (MINETAD) and CNIO.

Aims & Objectives

The strategic goals of the Text Mining Unit are:

  • To design and to develop biomedical language-processing resources with emphasis on oncology.
  • To provide consultancy and technical advice for language technologies in the biomedical domain.
  • To design requirements and standards for interoperability of biomedical language technologies.
  • To coordinate community assessment and evaluation challenges of biomedical text mining tasks.
  • To leverage the uptake of biomedical text mining technologies and relevant standards.

One of the main scopes of the unit is to provide biomedical text mining and language processing infrastructures that can be maintained efficiently over time and be integrated in biomedical analysis platforms comprising data from experimental outcomes of patient-derived information.

Tools & Components

· PoS tagger - Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing.

· Spanish Medical Abbreviation extractor - the software used to generate the Spanish Medical Abbreviation DataBase (https://github.com/PlanTL/AbreMES-DB). The database is generated by detecting abbreviations and their potential definitions explicitly mentioned in the same sentence, extracted from the metadata of different biomedical publications written in Spanish that contain the titles and abstracts.

· xxx:

· xxx:

· xxx:

Resources & Corpora

· GitHub

Online demos

Part-of-Speech Tagger for medical domain corpus in Spanish.

Events

· Infoday about Language Technologies in Healthcare: Infoday on Actions of the Plan to Promote Language Technologies in Healthcare, promoted by the State Secretariat for Digital Advancement and Red.es at SEPLN 2018. Infoday website is only available in Spanish.

· BARR2: Biomedical Abbreviation Recognition and Resolution 2nd Edition - IberEval workshop at SEPLN 2018

· MultilingualBio: Multilingual Biomedical Text Processing - LREC 2018

· II Hackathon of language technologies: Hackathon to promote the development of prototypes based on Natural Language Processing (NLP), machine translation and conversational systems. Website is only available in Spanish.

· BARR: Biomedical Abbreviation Recognition and Resolution - IberEval workhop at SEPLN 2017

Talks & Presentations

  • Gonzalez-Agirre, A.; Vivanco-Hidalgo, R.M.; Abilleira, S.; Gallofré, M.; Valencia, A.; Villegas, M. and Krallinger, M. Mining Spanish and Catalan Electronic Health Records: Extraction of Information on Diagnosis of Stroke from Discharge Reports. In 3rd European Conference on Translational Bioinformatics: Biomedical Big Data Supporting Precision Medicine, 2018.
  • People

    Publications

    2018

    • Villegas, M., Intxaurrondo, A., Gonzalez-Agirre, A., Marimon, M. and Krallinger, M. The MeSpEN resource for English-Spanish medical machine translation and terminologies: Census of parallel corpora, glossaries and term translations.In Proceedings of the LREC 2018 Workshop “MultilingualBIO: Multilingual Biomedical Text Processing, pp. 32–39. ISBN: 979-10-95546-03-0, EAN: 9791095546030. [PDF]
    • Intxaurrondo, A., Marimon, M., Gonzalez-Agirre, A., Lopez-Martin, J.A., Rodriguez, H., Santamaria, J., Villegas, M. and Krallinger, M. Finding mentions of abbreviations and their definitions in Spanish Clinical Cases: the BARR2 shared task evaluation results. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval-2018), co-located with the 34th Conference of the Spanish Society for Natural Language Processing (SEPLN-2018), pp. 280-289. [PDF]
    • Santamaría, J. and Krallinger, M. Construcción de recursos terminológicos médicos para el español: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos. In Procesamiento del Lenguaje Natural, nº 61, pp. 49-56. [PDF]
    • Corvi, J., Fernandez, J.M., Intxaurrondo, A., Krallinger, M., Valencia, A. and Capella-Gutierrez, S. Updating the LimTox Content provider workflow. In XIV Symposium on Bioinformatics (JBI-2018). [Abstract booklet (page 111)]

    2017

    • Villegas, M., de la Peña, S., Intxaurrondo, A., Santamaria, J. and Krallinger, M. Esfuerzos para fomentar la minería de textos de biomedicina más allá del inglés: el plan estratégico nacional español para las tecnologias del lenguaje.In Procesamiento del Lenguaje Naturalnº 59, pp. 141-144. ISBN: 979-10-95546-03-0, EAN: 9791095546030. [PDF]
    • Intxaurrondo, A., Perez-Perez, M., Perez-Rodriguez, G., Lopez-Martin, J.A., Santamaria, J., de la Peña, S., Villegas, M., Ahmad-Akhondi, S., Valencia, A., Lourenço, A. and Krallinger, M. The Biomedical Abbreviation Recognition and Resolution (BARR) Track: Benchmarking, Evaluation and Importance of Abbreviation Recognition Systems Applied to Spanish Biomedical Abstracts. In Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), pp. 230-246. [PDF]
    • Intxaurrondo, A. and Krallinger, M. CNIO at BARR IberEval 2017: Exploring Three Biomedical Abbreviation Identifiers for Spanish Biomedical Publications. In Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), pp. 278-285. [PDF]