ClinSpEn-Clinical Terms

The ClinSpEn-CT (clinical terms) dataset is a collection of EN-ES parallel biomedical terms.

  • Track direction: ES > EN
  • Link to data: Zenodo.

Overview

Translating clinical terminology is very relevant due to the existence of many established concepts and multi-word expressions (MWE) that need to be translated not only correctly but also consistently. Systems able to consider not only full sentences but also specific terms are able to provide more accurate translations, something fundamental in the clinical domain.

Corpus Description

ClinSpEn-CT includes a total of 19 128 terms. The terms have been extracted from biomedical literature and electronic health records and translated and revised by professional medical translators. Amongst others, the selected terms include diseases, symptoms and findings, procedures, drugs and species.

The terms are presented as tab-separated files (tsv). For more information on the TSV file specifics, please check Zenodo.

Corpus Partitions

ClinSpEn-CT is divided as follows:

  • ClinSpEn-CT Sample/Dev Set (7 000 terms)
  • ClinSpEn-CT Test Set (12 128 terms)

In addition, we include a larger collection of 201,890 monolingual (Spanish) terms (background set) that can be used to evaluate the systems’ performance in new, unseen data.

Below is an example of the CT dataset taken from the sample set: