MeSpEn - English-Spanish Medical Machine Translation and Terminologies

How to cite:

Marta Villegas, Ander Intxaurrondo, Aitor Gonzalez-Agirre, Montserrat Marimon, Martin Krallinger. (2018). The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies: Census of Parallel Corpora, Glossaries and Term Translations. In LREC MultilingualBIO: Multilingual Biomedical Text Processing. ELRA.

Link to article.

Biomedical and clinical literature

IBECS

IBECS (Spanish Bibliographical Index in Health Sciences) is a bibliographical database that collects scientific journals covering multiple fields in health sciences. It is maintained by the Spanish National Health Sciences Library (BNCS), at the Carlos III Health Institute.

This corpus contains titles and abstracts from 168,198 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by IBECS is provided as well.

MeSpEn_Parallel-Corpora at ZENODO

SciELO

SciELO (Scientific Electronic Library Online) gathers electronic publications of complete full text articles from scientific journals of Latin America, South Africa and Spain. Currently is present in 15 countries and supported by the Sao Paulo Research Foundation (FAPESP) and the Brazilian National Council for Scientific and Technological Development (BIREME).

This corpus contains titles and abstracts from 161,710 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format.

SciELO Sp-En Dublin Core at ZENODO

Pubmed

Pubmed is a free search engine used to access the MedlineNLM).

This corpus contains titles and abstracts from 127,619 records. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by PubMed is provided as well.

Pubmed Sp-En Dublin Core at ZENODO

Users can access to all Spanish articles in Pubmed by clicking here. Follow these steps to download all articles' metadata in XML format:

Click on Send to.
Select File on Choose destination.
Select XML on Format.
And finally click on Create File.

Patient information: MedlinePlus

MedlinePlus is an online information service provided by the U.S. National Library of Medicine (NLM), and gives free information about health in both English and Spanish. MedlinePlus provides the following information:

There are 2 corpora available for download:

Health topics metadata in Dublin Core format: the source code of the site stores metadata information about each topic, we created the DC files based on these metadata. This collection contains a total of 1,063 articles in English and Spanish.
MedlinePlus Sp-En Health Topics Dublin Core at ZENODO

Complete MedlinePlus in TEI format: clean raw text and XML files of each article, structured by sections and paragraphs. This collection contains a total of 7,033 articles in English and Spanish.
MedlinePlus TEI Sp-En files at ZENODO

Glossaries

46 bilingual glossaries for various language pairs from free online medical glossaries and dictionaries made by over 500 professional translators. Glossaries were encoded in standard tab-separated values (tsv) format. 26 glossaries include English terms, 8 glossaries include Spanish terms and 13 files include other languages.

Check our GitHub repository
Find all files at our Zenodo repository

External Links

UFAL Medical Corpus v1.0

UFAL Medical Corpus is a collection of parallel corpora assembled during the course of projects KConnect, Khresmoi and HimL aiming at more reliable machine translation of medical texts.

UFAL Medical Corpus v1.0.

EMEA corpus

EMEA is a corpus of biomedical documents retrieved from the European Medicines Agency (EMEA). The corpus includes documents related to medicinal products and their translations into 22 official languages of the European Union.

EMEA.

COPPA corpus

The COPPA corpus seeks to help users and researchers to overcome the language barrier when searching patents published in different languages and to stimulate research in Machine Translation and in language tools for patent texts.

COPPA.

Shared Task: Biomedical Translation Task (WMT18)

This task aims to evaluate systems on the translation of documents from the biomedical domain.

Shared Task: Biomedical Translation Task

MeSpEn: the resource for English-Spanish Medical Machine Translation and Terminologies:

Census of Parallel Corpora, Glossaries and Term Translations