Evaluation of automatic predictions for the PharmaConNER task will have two different scenarios or sub-tracks: NER offset and entity type classification sub-track and the concept indexing sub-track:
- NER offset and entity type classification: The first evaluation scenario will consist in the classical entity-based or instanced-based evaluation that requires that system outputs match exactly the beginning and end locations of each entity tag, as well as match the entity annotation type of the gold standard annotations.
- Concept indexing: The second evaluation scenario will consist of a concept indexing task where for each document, the list of unique SNOMED concept identifiers have to be generated by participating teams, which will be compared to the manually annotated concept ids corresponding to chemical compounds and pharmacological substances.
The primary evaluation metrics will consist of micro-averaged precision, recall
Precision (P) = true positives/(true positives + false positives)
Recall (R) = true positives/(true positives + false negatives)
F-score (F1) = 2*((P*R)/(P+R))
For both sub-tracks, the official evaluation and the ranking of the submitted systems will be based exclusively in the F-score (F1) measure.
As part of the evaluation process, we plan to carry out a statistical significance testing between system runs using approximate randomization, following setting previously used in the context of i2b2 challenges. The used evaluation scripts together with proper documentation and README files with instructions will be freely available on GitHub to enable evaluation tools source code local testing by participating teams.
As prediction baseline we will use vocabulary transfer results from the training/development set derived entity named and using gazetteer-lookup on the test set corpus.
See evaluation examples here.