BARR2 sub-track 1 | Run ID | Precision | Recall | F1-score |
---|---|---|---|---|
Fsanchez | 1 | 88.61 | 88.23 | 88.42 |
Vicomtech | regex+ML+pat | 88.29 | 76.05 | 81.71 |
Vicomtech | ML+regex | 88.56 | 74.79 | 81.09 |
Vicomtech | ML+pat | 87.34 | 75.63 | 81.08 |
Vicomtech | ML | 88.12 | 74.79 | 80.91 |
UNED | 3 | 85.36 | 73.53 | 79.01 |
UNED | 4 | 83.98 | 72.69 | 77.93 |
UNED | 5 | 84.84 | 70.59 | 77.06 |
UNED | 1 | 85.13 | 69.75 | 76.67 |
UNED | 2 | 91.20 | 47.90 | 62.81 |
BARR2 sub-track 1 | Run ID | Precision | Recall | F1-score | F1 strict | F1 canonical |
---|---|---|---|---|---|---|
Fsanchez | 1 | 85.34 | 81.11 | 83.17 | 79.85 | 82.89 |
Hospital-italiano | 3ul | 88.90 | 71.29 | 79.13 | 77.36 | 79.67 |
Hospital-italiano | 4ul | 87.08 | 71.64 | 78.61 | 76.80 | 79.73 |
Hospital-italiano | 4ul-sinpl | 86.95 | 71.54 | 78.50 | 76.71 | 79.73 |
Vicomtech | ML | 87.57 | 70.20 | 77.93 | 75.65 | 79.30 |
Vicomtech | MLRF | 86.41 | 70.44 | 77.61 | 75.51 | 79.09 |
Vicomtech | ML+regex |
81.58 | 73.36 | 77.25 | 74.88 | 78.80 |
Vicomtech | MLRF+regex | 81.72 | 72.89 | 77.05 | 74.68 | 78.58 |
Hospital-italiano |
3ul-hiba | 83.77 | 69.39 | 75.90 | 73.20 | 75.19 |
Hospital-italiano | 4ul-hiba | 75.97 | 67.85 | 71.68 | 68.87 | 71.91 |
UC3M | 2 | 74.93 | 37.69 | 50.16 | 46.82 | 50.72 |
UC3M | 3 | 74.92 | 37.69 | 50.15 | 46.82 | 50.72 |
UNED | 3 | 41.80 | 24.24 | 30.69 | 28.15 | 29.12 |
UC3M | 1 | 59.80 | 19.43 | 29.33 | 26.18 | 30.43 |
UNED | 1 | 38.88 | 22.54 | 28.54 | 25.81 | 27.04 |
UNED | 2 | 36.92 | 21.41 | 27.10 | 24.78 | 26.44 |
The evaluation metric used for the BARR2 consists in micro-average F-measure of explicit occurrences of abbreviation-definition pairs. F-measure stands for the harmonic mean between precision and recall:
Individual scores for each abbreviation-definition pair are computed in different ways for Task 1 and 2.
A total of 5 runs are allowed per team for each of the two BARR2 subtasks.
For each abbreviation-definition pair the system will return 1 if all the fields match (#Document_ID, Mention_A_Type, A_Start, A_End, Ment_A, Relation_type, etc.), and 0 otherwise. The abbreviation-definition pairs must follow one the following orders: short-long, short-nested, or nested-long. Reversing these orders will cause the occurrence to be qualified as 0.
In this task, we will compare predicted definitions to manual definitions by automatically comparing the tokens of both the definitions (predicted and manual), ignoring upper/lowercase and hyphenated characters.
For each abbreviation-definition pair in the prediction file, there must be a corresponding line in the Gold Standard with exactly the same Document_ID, StartOffset, EndOffset and Abbrev. Otherwise, the evaluation script returns 0 for this abbreviation-definition pair. After this check, the evaluation script runs 3 different evaluations:
Updated: 2018/06/21
Participants can download the evaluation script from this link.
This README.md file explains how the evaluation works. Please read it before to know how to use this evaluator.
To evaluate sub-track 2, a Spanish stop words file is needed. Participants may use this file or their own custom stop words file.
Source code is available at this link. The code is originally written in Java.
Ander Intxaurrondo
ander.intxaurrondo[AT]bsc.es