Biomedical Abbreviation Recognition and Resolution 2nd Edition (BARR2)

Sub-track 1

BARR2 sub-track 1	Run ID	Precision	Recall	F1-score
Fsanchez	1	88.61	88.23	88.42
Vicomtech	regex+ML+pat	88.29	76.05	81.71
Vicomtech	ML+regex	88.56	74.79	81.09
Vicomtech	ML+pat	87.34	75.63	81.08
Vicomtech	ML	88.12	74.79	80.91
UNED	3	85.36	73.53	79.01
UNED	4	83.98	72.69	77.93
UNED	5	84.84	70.59	77.06
UNED	1	85.13	69.75	76.67
UNED	2	91.20	47.90	62.81

Sub-track 2

BARR2 sub-track 1	Run ID	Precision	Recall	F1-score	F1 strict	F1 canonical
Fsanchez	1	85.34	81.11	83.17	79.85	82.89
Hospital-italiano	3ul	88.90	71.29	79.13	77.36	79.67
Hospital-italiano	4ul	87.08	71.64	78.61	76.80	79.73
Hospital-italiano	4ul-sinpl	86.95	71.54	78.50	76.71	79.73
Vicomtech	ML	87.57	70.20	77.93	75.65	79.30
Vicomtech	MLRF	86.41	70.44	77.61	75.51	79.09
Vicomtech	ML+regex	81.58	73.36	77.25	74.88	78.80
Vicomtech	MLRF+regex	81.72	72.89	77.05	74.68	78.58
Hospital-italiano	3ul-hiba	83.77	69.39	75.90	73.20	75.19
Hospital-italiano	4ul-hiba	75.97	67.85	71.68	68.87	71.91
UC3M	2	74.93	37.69	50.16	46.82	50.72
UC3M	3	74.92	37.69	50.15	46.82	50.72
UNED	3	41.80	24.24	30.69	28.15	29.12
UC3M	1	59.80	19.43	29.33	26.18	30.43
UNED	1	38.88	22.54	28.54	25.81	27.04
UNED	2	36.92	21.41	27.10	24.78	26.44

The evaluation metric used for the BARR2 consists in micro-average F-measure of explicit occurrences of abbreviation-definition pairs. F-measure stands for the harmonic mean between precision and recall:

Recall is computed as the correctly detected occurrences divided by the total number of occurrences in the Gold Standard file.

Precision is computed as the correctly detected occurrences divided by the total number of occurrences in the prediction file.

Individual scores for each abbreviation-definition pair are computed in different ways for Task 1 and 2.

A total of 5 runs are allowed per team for each of the two BARR2 subtasks.

For each abbreviation-definition pair the system will return 1 if all the fields match (#Document_ID, Mention_A_Type, A_Start, A_End, Ment_A, Relation_type, etc.), and 0 otherwise. The abbreviation-definition pairs must follow one the following orders: short-long, short-nested, or nested-long. Reversing these orders will cause the occurrence to be qualified as 0.

In this task, we will compare predicted definitions to manual definitions by automatically comparing the tokens of both the definitions (predicted and manual), ignoring upper/lowercase and hyphenated characters.

For each abbreviation-definition pair in the prediction file, there must be a corresponding line in the Gold Standard with exactly the same Document_ID, StartOffset, EndOffset and Abbrev. Otherwise, the evaluation script returns 0 for this abbreviation-definition pair. After this check, the evaluation script runs 3 different evaluations:

Ultra-strict evaluation: Evaluation script returns 1 if predicted definition and gold definition match completely. If not, the scripts lemmatizes both definitions and returns 1 if they match completely, and 0 otherwise. Token order is relevant in this evaluation, and the script returns 0 if the order is different, even if they share the same tokens. This evaluation will not be relevant to the task.

Strict evaluation: Evaluation script performs tokenization and stop-word removal on the predicted definition and the gold definition. Then, it returns 1 if predicted definition and gold definition match, without taking into account the token order. If not, the scripts lemmatizes both definitions, and returns 1 if they match, without taking into account the token order, and 0 otherwise.

Flexible evaluation: This evaluation returns scores between 0 and 1 for each abbreviation-definition pair. Evaluation script performs stop-word removal on the predicted definition and gold definitions and generates tokenized and lemmatized versions of them. Then, it computes the number of tokens/lemmas in the prediction definition that are present in the gold definition divided by the highest number of tokens/lemmas among both definitions and returns the highest score between the two scores (using tokens or lemmas).

Updated: 2018/06/21

Participants can download the evaluation script from this link.

This README.md file explains how the evaluation works. Please read it before to know how to use this evaluator.

To evaluate sub-track 2, a Spanish stop words file is needed. Participants may use this file or their own custom stop words file.

Source code is available at this link. The code is originally written in Java.

Ander Intxaurrondo
ander.intxaurrondo[AT]bsc.es

June 21th, 2018: BARR2 final test set clinical cases revealed.
May 28th, 2018: BARR2 background and test sets released.
May 25th, 2018: BARR2 development set released.
May 18th, 2018: BARR2 evaluation script released.
May 17th, 2018: BARR2 training set released.
May 8th, 2018: BARR2 announcement at MultilingualBIO Workshop (LREC 2018).
April 20th, 2018: BARR2 sample set released.
March 15th, 2018: BARR2 track website launched.

Biomedical Abbreviation Recognition and Resolution 2nd Edition (BARR2)

IberEval 2018 | SEPLN 2018

18 September 2018. Seville, Spain

Results

Sub-track 1

Sub-track 2

Evaluation

Task 1

Task 2

Evaluation script

Contact

News