Datasets can be already downloaded from Zenodo.


The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The training set contains 500 clinical cases, and the development and test set 250 clinical cases each.

CodiEsp corpus will be published translated to English during the week previous to the test set release (10-14 February).

Train set

The train set is composed of 500 clinical cases. Download it from Zenodo.

Development set

The Development set is composed of 250 clinical cases. Download it from Zenodo.

Test and Background set

The test set has 250 clinical cases. The background set is composed of 2,751 clinical cases. Download them from Zenodo.

Test set with Gold Standard annotations

The Test set is with Gold Standard annotations is composed of 250 clinical cases. Download it from Zenodo.


Additional Datasets

Additional set 1 – Abstracts with ICD10 codes.

To expand the train and development corpora, a JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated with them (CIE10 in Spanish) is provided. Download it from ZENODO.

The format of the JSON file is the following:

{'articles':
	[{'title': 'title',
	'pmid': 'pmid',
	'abstractText': 'abtract (in Spanish)',
	'Mesh':
		[{'Code': 'MeSHCode',
		'Word': 'reference',
		'CIE': [CIE10_1, CIE10_2, ...]},
		...]
	},
	...]
}

There are 176 294 abstracts with an average of 2.5 ICD10 codes per abstract.

In addition, the same additional dataset is provided in the CodiEsp format: abstracts are distributed in individual UTF-8 text files and a tab-separated file summarizes the JSON information in four columns:

articleID label ICD10-code word