Biomedical Abbreviation Recognition and Resolution 2nd Edition (BARR2)

IberEval 2018 | SEPLN 2018

18 September 2018. Seville, Spain

Results

Sub-track 1

BARR2 sub-track 1 Run ID Precision Recall F1-score
Fsanchez 1 88.61 88.23 88.42
Vicomtech regex+ML+pat 88.29 76.05 81.71
Vicomtech ML+regex 88.56 74.79 81.09
Vicomtech ML+pat 87.34 75.63 81.08
Vicomtech ML 88.12 74.79 80.91
UNED 3 85.36 73.53 79.01
UNED 4 83.98 72.69 77.93
UNED 5 84.84 70.59 77.06
UNED 1 85.13 69.75 76.67
UNED 2 91.20 47.90 62.81

Sub-track 2

BARR2 sub-track 1 Run ID Precision Recall F1-score F1 strict F1 canonical
Fsanchez 1 85.34 81.11 83.17 79.85 82.89
Hospital-italiano 3ul 88.90 71.29 79.13 77.36 79.67
Hospital-italiano 4ul 87.08 71.64 78.61 76.80 79.73
Hospital-italiano 4ul-sinpl 86.95 71.54 78.50 76.71 79.73
Vicomtech ML 87.57 70.20 77.93 75.65 79.30
Vicomtech MLRF 86.41 70.44 77.61 75.51 79.09
Vicomtech ML+regex
81.58 73.36 77.25 74.88 78.80
Vicomtech MLRF+regex 81.72 72.89 77.05 74.68 78.58
Hospital-italiano
3ul-hiba 83.77 69.39 75.90 73.20 75.19
Hospital-italiano 4ul-hiba 75.97 67.85 71.68 68.87 71.91
UC3M 2 74.93 37.69 50.16 46.82 50.72
UC3M 3 74.92 37.69 50.15 46.82 50.72
UNED 3 41.80 24.24 30.69 28.15 29.12
UC3M 1 59.80 19.43 29.33 26.18 30.43
UNED 1 38.88 22.54 28.54 25.81 27.04
UNED 2 36.92 21.41 27.10 24.78 26.44

Evaluation

The evaluation metric used for the BARR2 consists in micro-average F-measure of explicit occurrences of abbreviation-definition pairs. F-measure stands for the harmonic mean between precision and recall:

  • Recall is computed as the correctly detected occurrences divided by the total number of occurrences in the Gold Standard file.
  • Precision is computed as the correctly detected occurrences divided by the total number of occurrences in the prediction file.

Individual scores for each abbreviation-definition pair are computed in different ways for Task 1 and 2.

A total of 5 runs are allowed per team for each of the two BARR2 subtasks.


Task 1

For each abbreviation-definition pair the system will return 1 if all the fields match (#Document_ID, Mention_A_Type, A_Start, A_End, Ment_A, Relation_type, etc.), and 0 otherwise. The abbreviation-definition pairs must follow one the following orders: short-long, short-nested, or nested-long. Reversing these orders will cause the occurrence to be qualified as 0.


Task 2

In this task, we will compare predicted definitions to manual definitions by automatically comparing the tokens of both the definitions (predicted and manual), ignoring upper/lowercase and hyphenated characters.

For each abbreviation-definition pair in the prediction file, there must be a corresponding line in the Gold Standard with exactly the same Document_ID, StartOffset, EndOffset and Abbrev. Otherwise, the evaluation script returns 0 for this abbreviation-definition pair. After this check, the evaluation script runs 3 different evaluations:

  • Ultra-strict evaluation: Evaluation script returns 1 if predicted definition and gold definition match completely. If not, the scripts lemmatizes both definitions and returns 1 if they match completely, and 0 otherwise. Token order is relevant in this evaluation, and the script returns 0 if the order is different, even if they share the same tokens. This evaluation will not be relevant to the task.
  • Strict evaluation: Evaluation script performs tokenization and stop-word removal on the predicted definition and the gold definition. Then, it returns 1 if predicted definition and gold definition match, without taking into account the token order. If not, the scripts lemmatizes both definitions, and returns 1 if they match, without taking into account the token order, and 0 otherwise.
  • Flexible evaluation: This evaluation returns scores between 0 and 1 for each abbreviation-definition pair. Evaluation script performs stop-word removal on the predicted definition and gold definitions and generates tokenized and lemmatized versions of them. Then, it computes the number of tokens/lemmas in the prediction definition that are present in the gold definition divided by the highest number of tokens/lemmas among both definitions and returns the highest score between the two scores (using tokens or lemmas).

Evaluation script

Updated: 2018/06/21

Participants can download the evaluation script from this link.

This README.md file explains how the evaluation works. Please read it before to know how to use this evaluator.

To evaluate sub-track 2, a Spanish stop words file is needed. Participants may use this file or their own custom stop words file.

Source code is available at this link. The code is originally written in Java.

Contact

Ander Intxaurrondo
ander.intxaurrondo[AT]bsc.es

News

  • June 21th, 2018: BARR2 final test set clinical cases revealed.
  • May 28th, 2018: BARR2 background and test sets released.
  • May 25th, 2018: BARR2 development set released.
  • May 18th, 2018: BARR2 evaluation script released.
  • May 17th, 2018: BARR2 training set released.
  • May 8th, 2018: BARR2 announcement at MultilingualBIO Workshop (LREC 2018).
  • April 20th, 2018: BARR2 sample set released.
  • March 15th, 2018: BARR2 track website launched.