Data – MEDDOCAN

Description of the Corpus

For this task, we have prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. This MEDDOCAN corpus of 1,000 clinical case studies was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records.

The final collection of 1,000 clinical cases that make up the corpus had around 33 thousand sentences, with an average of around 33 sentences per clinical case. The MEDDOCAN corpus contains around 495 thousand words, with an average of 494 words per clinical case, slightly less than for the records of the i2b2 de-identification longitudinal corpus (617 tokens per record). The MEDDOCAN corpus will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file, while PHI annotations will be released in the popular BRAT format, which makes visualization of results straightforward, as you can see in Figure 1.

Figure 1: An example of MEDDOCAN annotation visualized using the BRAT annotation interface.

The entire MEDDOCAN corpus has been randomly sampled into three subsets, the training, development and test set. The training set comprises 500 clinical cases, and the development and test set 250 clinical cases each. Together with the test set release, we will release an additional collection of 2,000 documents (background set) to make sure that participating teams will not be able to do manual corrections and also promote that these systems would potentially able to scale to larger data collections.

For this task we also prepared a conversion script (see Resources) between the BRAT annotation format and the annotation format used by the previous i2b2 effort, to make comparison and adaptation of previous systems used for English texts easier.

Datasets

The MEDDOCAN corpus has been randomly sampled into three subset: the train, the development, and the test set. The training set contains 500 clinical cases, and the development and test set 250 clinical cases each.

Sample set

The sample set is composed of 15 clinical cases extracted from the training set. This sample set is also included in the evaluation script (see Resources). Download the sample set from here.

Train set

The train set is composed of 500 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the train set from here.

Development set

The Development set is composed of 250 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the development set from here.

Background set

The background set is composed of 2,751 clinical cases. It is distributed in plain text format. Download the background set from here.

Test set with Gold Standard annotations

The Test set is with Gold Standard annotations is composed of 250 clinical cases. It is distributed in Brat and XML formats (the latter is based on the i2b2 XML format). Download the Test set with Gold Standard annotations from here.

Annotation Guidelines

Official Annotation Guidelines used to annotate the MEDDOCAN data sets can be downloaded from here.

The MEDDOCAN annotation scheme defines a total of 29 granular entity types grouped into more general parent classes. Figure 1 summarized the list of sensitive entity types defined for the MEDDOCAN track.

Figure 1: Overview of the sensitive entity types of PHI types defined for the MEDDOCAN annotation scheme and track.

The annotation process of the MEDDOCAN corpus was inspired initially by previous annotation schemes and corpora used for the i2b2 de-identification tracks, revising the guidelines used for these tracks, translating certain characteristics into Spanish and adapting them to the specificities and needs of our document collection and legislative framework. This adaptation was carried out in collaboration with practicing physicians, a team of annotators and the University Hospital 12 de Octubre. The adaptation, translation, and refinement of the guidelines was carried out on several random sample sets of the MEDDOCAN corpus and connected to an iterative process of annotation consistency analysis through inter-annotator agreement (IAA) calculation until a high annotation quality on terms of IAA was reached. Three cycles of refinement and IAA analysis were needed in order to reach the quality criteria required for this track, being in line with similar scores obtained for instance for i2b2. A link to the final version of the used 28 pages annotation guidelines can be downloaded from here.

This iterative refinement process required direct interaction between the expert annotators in order to resolve discrepancies, using a side-by-side visualization of clinical case document annotations, by comparing discrepancies and going over each of them in order to solve doubts, add/refine rules and add/edit clarifying examples to the guidelines. The final, inter-annotator agreement measure obtained for this corpus was calculated on a set of 50 records that were double annotated (blinded) by two different expert annotators, reaching a pairwise agreement of 98% on the exact entity mention comparison level together with the corresponding mention type labels.

The manual annotation of the entire corpus was carried out in a three-step approach. First, an initial annotation process was carried out on an adapted version of the AnnotateIt tool. The resulting annotations were then exported, trailing whitespaces were removed and double annotations of the same string were send as an alert to the human annotators for revision/correction. Then, the annotations were uploaded into the BRAT annotation tool, which was slightly less efficient for mention labeling in terms of manual annotation speed. The human annotators performed afterwards a final revision of the annotations in BRAT, in order to correct mistakes and to add potentially missing annotation mentions. Finally the senior annotator did a last round of annotation revision of the entire corpus. Figure 2 shows an example of an annotation in the BRAT format.