Evaluation

For the MEDDOCAN track, we will follow essentially a similar evaluation setting as used for previous de-identification tracks posed at i2b2. We will set up an external scientific advisory board with international experts on this topic, to provide feedback and experience from de-identification efforts carried out in the US and UK.

We are also aware that there is a considerable implicit variability between document types and between hospitals that in practice do affect the difficulty of this kind of tracks, but being the first time such a task is being carried out for Spanish we do not want to add another additional level of complexity. From previous de-identification efforts it became clear that different uses might require different balance in terms of precision and recall. For instance for internal use within the hospital settings (limited data release), high precision is more desirable as there is a reduced risk of exposure of these documents, while for instance in case of HIPAA-compliant release in the US, high recall is critical to avoid sensitive data leakage (unlimited data release).

Evaluation of automatic predictions for this task will have two different scenarios or sub-tracks: the NER offset and entity type classification sub-track and the sensitive span detection sub-track.

NER offset and entity type classification: The first evaluation scenario will consist of the classical entity-based or instanced-based evaluation that requires that system outputs match exactly the beginning and end locations of each PHI entity tag, as well as detecting correctly the annotation type.
Sensitive span detection: The second evaluation scenario or sub-track is more specific to the practical scenario needed for releasing de-identified clinical documents, where the ultimate goal is to identify and be able to obfuscate or mask sensitive data, regardless the actual type of entity or the correct offset identification of multi-token sensitive phrase mentions. This second sub-track will consider a span-based evaluation, by just evaluating whether spans belonging to sensitive phrases are detected correctly. This boils down to a classification of spans, where systems try to obfuscate spans that contain sensitive PHI expressions.

As part of the evaluation process, we plan to carry out a statistical significance testing between system runs using approximate randomization, following settings previously used in the context of i2b2
challenges. The used evaluation scripts together with proper documentation and README files with instructions will be freely available on GitHub, to enable local testing of evaluation scripts by
participating teams.

For both sub-tracks the primary de-identification metrics used will consist of standard measures from the NLP community, namely micro-averaged precision, recall, and balanced F-score:

Precision (P) = true positives/(true positives + false positives)

Recall (R) = true positives/(true positives + false negatives)

F-score (F1) = 2*((P*R)/(P+R))

For both sub-tracks, the official evaluation and the ranking of the submitted systems will be based exclusively in the F-score (F1) measure (labeled as “SubTrack 1 [NER]” and “SubTrack 2 [strict]” in the evaluation script). The other metrics explained below are given only to provide more detailed information about the performance of the systems.

Moreover, there will be also sub-track specific evaluation metrics. In case of the first sub-track the leak scores, previously proposed for the i2b2 challenges will be computed, being related to detection of leaks (non-redacted PHI remaining after de-identification), that is (# false negatives / # sentences present). In the case of the second sub-track we will also additionally compute another evaluation where we will merge the spans of PHI connected by non-alphanumerical characters. These metrics are not the official metrics of the task.

In terms of participating submissions, we will allow up to 5 runs by each registered team. Submissions have to be provided in a predefined prediction format (brat o i2b2) and be returned to the track organizers before the test set submission due (end of May).

See evaluation examples here.