In this section, we show examples of how the evaluation will be carried out to make it clearer. For the following examples, we will consider that this set of MEDDOCAN tags is our Gold Standard (GS):

Example of tags in a Gold Standard file

This GS file is in i2b2 format. We have a DATE entity, a CONTACT entity, four LOCATION entities, and a NAME entity. Each entity tag is composed of the ID, the START and END offsets, the text snippet between the previous offsets, the entity TYPE and a COMMENT.

NOTE: The ID, TEXT, and COMMENT fields are not used for any of the evaluation metrics. The number in the ID field and the COMMENT field are arbitrary, and the evaluation of the TEXT field is implicit in the offset evaluation, as the text is the same for the GS and the systems.

Sub-track 1: NER offset

In this sub-track we want to match exactly the beginning and end locations of each PHI entity tag, as well as detecting correctly the annotation type. The following system annotations will be accepted by the evaluation script even if the numbers on the ID fields are different, and additional comments are included.

Example of tags in a system submission file

For this example the scores obtained by this system are the following:

Precisión has been computed dividing true positives (4) by the sum of true positives and false positives (0), scoring 4/(4+0) = 1.0. Recall has been computed dividing true positives by the sum of true positives and false negative (3: Tags with IDs T1, T5, and T7 in the GS), scoring 4/(4+3) = 0.5714. Finally, F1 is computed using precision and recall, scoring 2*((1*0.5714)/(1+0.5714))= 2*(0.5714/1.5714) = 0.7273.

NOTE: This is just an example. We are aware that achieving a precision score of 1.0 is quite a difficult task.

Sub-track 2: Sensitive span detection

For this second sub-track the goal is only to identify sensitive data. Thus, we will consider a span-based evaluation, regardless of the actual type of entity or the correct offset identification of multi-token sensitive phrase mentions.

Strict span evaluation

In the strict evaluation metric, the evaluation script will accept as correct annotations in the submissions that match exactly the start and end offset of the annotations in the GS file. For instance, it will accept as correct these tags:

Example of tags in a system submission file

But also the following ones, even if the entity tag and type are not specified (the system has submitted PHI and OTHER as ENTITY and TYPE, respectively):

Example of a system submission file with unspecified tags

Merged spans evaluation

For this sub-track we will also additionally compute another evaluation where we will merge the spans of PHI connected by non-alphanumerical characters. For instance, consider this tag in a GS file:

Example of merged span evaluation: GS tag

For the previous tag, in the merged spans evaluation, the script will accept as correct the following annotation (this example is also accepted by the strict span evaluation metric):

Example of merged span evaluation: system annotation with an unspecified tag

But also the following one, because the scripts merges the spans of the annotation if there are ony non-alphanumerical characters between them:

Example of merged span evaluation: system annotations with unspecified tags matching the GS annotation using merging of spans

This merging process is carried out both in the GS file and in the system submission file. Therefore, the result is symmetric even if the larger span is given by the system (“Navarro Cuéllar” and “Ignacio” spans in the GS file and “Navarro Cuéllar, Ignacio” in the system file).