For organizational reasons, the evaluation system has been modified and instead of following the traditional BioASQ method in successive cycles, the evaluation of the competition will be done against a manually annotated data set purposely created for this task.
Evaluation of systems
Participating teams will have to generate, for each document, the list of unique DeCS codes, which will be compared to the manually annotated DeCS codes. This list of codes must be ordered by the confidence: codes with greater confidence first. The value for confidence is not required, but the list must be ordered. This is only for a deeper analysis of the systems: we are not going to evaluate the correct order of the codes.
The participating systems will be assessed for their performance based on a flat measure: the label-based micro F-measure, which is the official evaluation metric of the task.
In order to measure the classification performance, the manual evaluation of the systems will be organized as follows:
- 1,000 abstracts will be manually annotated by at least two human experts.
- A cross validation process will identify discrepant annotations.
- In case of discrepancy, a consensus phase will begin in which the annotators must agree on their annotations.
- In the event that, during the testing phase, LILACS / IBECS published new indexations, these will be added to the consensus process. In this case, the consensus should take into account the annotations of the two human experts plus the new annotations coming from LILACS / IBECS.
- Participants will have 500 manually annotated abstracts as a Development set or try run.
- The remaining 500 will comprise the Test set that will be used for the evaluation of the task.
For reasons of completeness, we may also provide additional evaluation measures (as was the case in previous BioASQ competition) that will take into account the hierarchical structure of the ontology, but the only official evaluation metric is the label-based flat measure explained previously. These additional measures are described in more detail on the BioASQ page (“Evaluation” section).
Generation of a public Silver Standard Corpus
Participants will be asked to automatically annotate 1M heterogeneous documents, called the Background set, that will include the 500 documents from the Test set. This approach makes sure that participating teams were not able to do manual corrections and also to promote that these systems would potentially be able to scale to large data collections.
Additionally, these annotations will generate a public Spanish Silver Standard Corpus of semantically indexed documents.
For more information, visit BioASQ page.