For organization reasons, the evaluation systems has been modified and instead of following the traditional BioASQ method in successive cycles, the evaluation of the competition will be done against a manually annotated data set purposely created for this task.
Participating teams will have to generate, for each document of the test set, the list of unique DeCS codes, which will be compared to the manually annotated DeCS codes. This list of codes must be ordered by the confidence: codes with greater confidence first. The value for confidence is not required, but the list must be ordered. This is only for a deeper analysis of the systems: we are not going to evaluate the correct order of the codes.
The participating systems will be assessed for their performance based o a flat measure: the label-based micro F-measure, which is the official evaluation metric of the task.