The Text Mining Unit (TEMU) at Barcelona Super Computing Center focuses on the application and development of biomedical text mining technologies, which are becoming a key tool for the efficient exploitation of information contained in unstructured data repositories including the scientific literature, electronic health records (EHRs), patents, biobank metadata, clinical trials and social media. The unit has a particular interest in processing clinical documents written in Spanish and other co-official languages in the area of health-related topics and the integration of molecular and biological information derived from the literature.
The unit is fully funded through the “Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL)”, in the framework of an agreement (“encomienda”) between the Secretary of State of Telecommunications of the Spanish Ministry of Energy, Tourism and the Digital Agenda (MINETAD) and BSC.
The strategic goals of the Text Mining Unit are:
One of the main scopes of the unit is to provide biomedical text mining and language processing infrastructures that can be maintained efficiently over time and be integrated in biomedical analysis platforms comprising data from experimental outcomes of patient-derived information.
This page provides links to various types of resources, developed both by TEMU and externally.
SPACCC_SPLIT - A collection of 1,000 clinical cases in Spanish where sentence boundary symbols are marked-up.
SPACCC_TOKEN - A collection of 1,000 clinical cases in Spanish where sentence tokens are marked-up.
SPACCC_POS A collection of 1,000 clinical cases in Spanish annotated with Part-of-Speech tags.
Spanish Medical Abbreviation DataBase - The database is created automatically by detecting abbreviations and their potential definitions explicitly mentioned in the same sentence. These abbreviations are extracted from the metadata of different biomedical publications written in Spanish, which contain the titles and abstracts. The sources of these publications are SciELO, IBECS and Pubmed.
Bilingual medical glossaries - Bilingual medical glossaries for various language pairs generated from free online medical glossaries and dictionaries made by professional translators.
Translation models for neural machine translation
A number of translation models for neural machine translation needed to use the Neural Machine Translation (NMT) system for the Biomedical Domain. The available language directions for translation are: English to Spanish, Spanish to English, English to Portuguese, Portuguese to English, Spanish to Portuguese and Portuguese to Spanish.