ClinSpEn-Clinical Cases

The ClinSpEn-CC (clinical cases) dataset is a collection of EN-ES parallel COVID-19 clinical cases.

Track direction: EN > ES
Link to data: Zenodo.

Overview

Clinical cases are a text genre where a patient’s current condition, medical history, clinical presentation, examinations, treatment and diagnosis are described. They can be pretty similar to Electronic Health Records (EHRs) both in form and content. However, unlike EHRs, clinical cases are often free of privacy-related issues. This means that they can be used as substitute to train NLP systems for the clinical domain.

Corpus Description

The dataset’s clinical cases were carefully selected to cover a wide range of aspects related to COVID-19: different types of patients (children, adults, elderly and pregnant people, babies), different comorbidities (cancer, mental health issues, immunosuppressed patients) and symptomatology (mild and severe presentations, dermatologic, immunologic and psychiatric manifestations, thrombosis, …). The reports were translated from English to Spanish by a professional medical translator on a first step and revised by a clinical expert on a second step.

ClinSpEn-CC includes a total of 202 parallel clinical cases. Each file is duplicated, with the Spanish version having a “.es” extension and the English files having a “.en” extension. Each report has been parallelized so that every sentence’s line number corresponds to the same sentence’s line number in both languages.

Corpus Partitions

ClinSpEn-CC is divided as follows:

ClinSpEn-CC Sample Set (50 documents)
ClinSpEn-CC Test Set (152 documents)

In addition, we include a larger collection of 9,804 monolingual (English) clinical cases of different topics and specialties (background set) that can be used to evaluate the systems’ performance in new, unseen data.

Below is an example of a parallel case report taken from the sample set: