ClinSpEn-Ontology Concepts

The ClinSpEn-OC (ontology concepts) dataset is a collection of EN-ES biomedical controlled vocabulary concepts.

  • Track direction: EN > ES
  • Link to data: Zenodo.

Overview

Ontologies are one of the main ways of structuring knowledge. However, their everyday use can be greatly limited by their unavailability in languages other than English. Machine translation systems specifically trained for this type of data can be of great help to improve the impact of these ontologies or to ease a manual translation process.

Corpus Description

The concepts for this task have been extracted from various free-access biomedical ontologies and taxonomies and then manually translated by a professional medical translator. Due to their origin, these concepts may present different challenges than terms extracted from free text, such as semi-structured concepts.

The ClinSpEn-OC Gold Standard includes a total of 2189 concepts. The terms are presented as tab-separated files (tsv). For more information on the TSV file specifics, please check Zenodo.

Corpus Partitions

ClinSpEn-OC is divided as follows:

  • ClinSpEn-OC Sample Set (400 concepts)
  • ClinSpEn-OC Test Set (1789 concepts)

In addition, we include a larger collection of 299,408 monolingual (English) concepts (background set) that can be used to evaluate the systems’ performance in new, unseen data.

Below is an example of the Ontology Concepts data taken from the sample set: