For the MEDDOCAN track, we will follow essentially a similar evaluation setting as used for previous de-identification tracks
We are also aware that there is
Evaluation of automatic predictions for this task will have two different scenarios or sub-tracks: the NER offset and entity type classification sub-track and the sensitive span detection sub-track.
• NER offset and entity type classification: The first evaluation scenario will consist of the classical entity-based or instanced-based evaluation that requires that system outputs match exactly the beginning and end locations of each PHI entity tag, as well as detecting correctly the annotation type.
• Sensitive span detection: The second evaluation scenario or sub-track is more specific to the practical scenario needed for releasing de-identified clinical documents, where the ultimate goal is to identify and be able to obfuscate or mask sensitive data, regardless the actual type of entity or the correct offset identification of multi-token sensitive phrase mentions. This second sub-track will consider a span-based evaluation, by just evaluating whether spans belonging to sensitive phrases are detected correctly. This boils down to a classification of spans, where systems try to obfuscate spans that contain sensitive PHI expressions.
As part of the evaluation process, we plan to carry out a statistical significance testing between system runs using approximate randomization, following settings previously used in the context of i2b2
challenges. The used evaluation scripts together with proper documentation and
For both sub-tracks the primary de-identification metrics used will consist of standard measures from the NLP community, namely micro-averaged precision, recall, and balanced F-score:
Precision (P) = true positives/(true positives + false positives)
Recall (R) = true positives/(true positives + false negatives)
F-score (F1) = 2*((P*R)/(P+R))
For both sub-tracks, the official evaluation and the ranking of the submitted systems will be based exclusively in the F-score (F1) measure (labeled as “
Moreover, there will be also sub-track specific evaluation metrics. In case of the first sub-track the leak scores, previously proposed for the i2b2 challenges will be computed, being related to detection of leaks (non-redacted PHI remaining after de-identification), that is (# false negatives / # sentences present). In the case of the second
In terms of participating submissions, we will allow up to 5 runs by each registered team. Submissions have to be provided in a predefined prediction format (brat o i2b2) and be returned to the track organizers before the test set submission due (end of May).
See evaluation examples here.