Datasets – MEDDOCAN

The MEDDOCAN corpus has been randomly sampled into three subset: the train, the development, and the test set. The training set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Sample set

The sample set is composed of 15 clinical cases extracted from the training set. This sample set is also included in the evaluation script (see Resources). Download the sample set from here.

Train set

The train set is composed of 500 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the train set from here.

Development set

The Development set is composed of 250 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the development set from here.

Background set

The background set is composed of 2,751 clinical cases. It is distributed in plain text format. Download the background set from here.

Test set with Gold Standard annotations

The Test set is with Gold Standard annotations is composed of 250 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the Test set with Gold Standard annotations from here.