MeSpEn: the resource for English-Spanish Medical Machine Translation and Terminologies:

Census of Parallel Corpora, Glossaries and Term Translations

How to cite:

Marta Villegas, Ander Intxaurrondo, Aitor Gonzalez-Agirre, Montserrat Marimon, Martin Krallinger. (2018). The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies: Census of Parallel Corpora, Glossaries and Term Translations. In LREC MultilingualBIO: Multilingual Biomedical Text Processing. ELRA.

Link to article.

Biomedical and clinical literature

IBECS

IBECS (Spanish Bibliographical Index in Health Sciences) is a bibliographical database that collects scientific journals covering multiple fields in health sciences. It is maintained by the Spanish National Health Sciences Library (BNCS), at the Carlos III Health Institute.

This corpus contains titles and abstracts from 168,198 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by IBECS is provided as well.

SciELO

SciELO (Scientific Electronic Library Online) gathers electronic publications of complete full text articles from scientific journals of Latin America, South Africa and Spain. Currently is present in 15 countries and supported by the Sao Paulo Research Foundation (FAPESP) and the Brazilian National Council for Scientific and Technological Development (BIREME).

This corpus contains titles and abstracts from 161,710 records in English and Spanish. Users can find the metadata of each record written in Dublin Core format.

Pubmed

Pubmed is a free search engine used to access the MedlineNLM).

This corpus contains titles and abstracts from 127,619 records. Users can find the metadata of each record written in Dublin Core format. The original XML file of the record provided by PubMed is provided as well.

Users can access to all Spanish articles in Pubmed by clicking here. Follow these steps to download all articles' metadata in XML format:

  • Click on Send to.
  • Select File on Choose destination.
  • Select XML on Format.
  • And finally click on Create File.

Patient information: MedlinePlus

MedlinePlus is an online information service provided by the U.S. National Library of Medicine (NLM), and gives free information about health in both English and Spanish. MedlinePlus provides the following information:

There are 2 corpora available for download:


Glossaries

46 bilingual glossaries for various language pairs from free online medical glossaries and dictionaries made by over 500 professional translators. Glossaries were encoded in standard tab-separated values (tsv) format. 26 glossaries include English terms, 8 glossaries include Spanish terms and 13 files include other languages.


External Links

UFAL Medical Corpus v1.0

UFAL Medical Corpus is a collection of parallel corpora assembled during the course of projects KConnect, Khresmoi and HimL aiming at more reliable machine translation of medical texts.

EMEA corpus

EMEA is a corpus of biomedical documents retrieved from the European Medicines Agency (EMEA). The corpus includes documents related to medicinal products and their translations into 22 official languages of the European Union.

COPPA corpus

The COPPA corpus seeks to help users and researchers to overcome the language barrier when searching patents published in different languages and to stimulate research in Machine Translation and in language tools for patent texts.

Shared Task: Biomedical Translation Task (WMT18)

This task aims to evaluate systems on the translation of documents from the biomedical domain.