The MEDDOPROF corpus has been randomly sampled into two subsets: train and test set.

The complete dataset is available in Zenodo.

Sample set

The sample set is composed of 15 clinical cases extracted from the training set. In order to make the sample set somewhat representative of the corpus, we included cases from four different specialties: radiology, oncology, psychiatry and occupational health.

Download the sample set from Zenodo.

Training set

The training set is composed of 1500 clinical cases (~80% of the corpus).

Download the training set from Zenodo.

Codes Reference List

For task 3 (MEDDOPROF-NORM), a reference list with all valid codes is provided. It is a .tsv file with three columns: code, label and alternative label. Codes from two sources are listed: ESCO and SNOMED-CT (these are preceded by the string ‘SCTID:’ in the list). With a few exceptions, professions are mapped to ESCO, while working statuses and activities are mapped to SNOMED-CT.

Download the codes reference list from Zenodo.

Test set

The test set is composed of 344 clinical cases (~20% of the corpus).

Download the test set from Zenodo.