Extra datasets

Semantic indexing of content is a demanding and complex task. Within a text, there may be presence of some elements, such as mentions of medications or diseases, that could facilitate the prediction of DeCS descriptors by the participants’ models. For example, the Figure shows the mentions of medical procedures and diseases found in the abstract of a COVID-10 scientific manuscript. There are mentions of pneumonia, thrombosis, and acute respiratory distress that could facilitate the prediction of COVID-19 related labels.

The Text Mining Unit of the Barcelona Supercomputing Center has been working on BioMedical NERs for many years. Therefore, in order to provide additional data to improve the performance of your models, we have applied some of our NER models and provided you with the results. Specifically, we have extracted entities related to medications, diseases, symptoms, and medical procedures. We provide this information to the participants as additional data in the “Additional Data” folder of Zenodo. For each training, development and test set there is an additional json file with the structure shown below, where span refers to the character string of the entity found; start is the character number within the text where the entity begins; and end represents the number of the character where the entity ends.

{
  "articles": [
    {
      "id": "Id of the article",
      "diseases": [
		{"span": "this is a disease", "start": "1", "end": "19"},
                {"span": "this is a another disease", "start": "125", "end": "64"}
      ],
      "medications": [],
      "procedures": [],
      "symptoms": []
    }
  ]
}

The following table shows some of the statistics of the entities found in MESINESP2 datasets:

DiseasesMedicationsProceduresSymptoms
Subtrack 1 training
Total mentions
Avg mentions/doc (Std)
Min
Max

708175
2.84 (3.55)
0
47

85913
0.34 (1.49)
0
53

361440
1.43 (2.33)
0
33

127233
0.51 (1.33)
0
34
Subtrack 1 development
Total mentions
Avg mentions/doc (Std)
Min
Max

3576
3.36 (4.11)
0
226

1237
1.16 (2.48)
0
19

1487
1.40 (2.32)
0
18

577
0.54 (1.39)
0
13
Subtrack 2 training
Total mentions
Avg mentions/doc (Std)
Min
Max

125294
34.88 (19.83)
0
146

83279
23.18 (16.25)
0
185

50629
14.09 (9.74)
0
68

9786
2.72 (4.20)
0
46
Subtrack 2 development
Total mentions
Avg mentions/doc (Std)
Min
Max

4338
29.51 (20.74)
0
99

3024
20.57 (18.48)
0
99

1937
13.18 (12.05)
0
64

354
2.41 (4.00)
0
24
Subtrack 3 development
Total mentions
Avg mentions/doc (Std)
Min
Max

171
1.57 (3.49)
0
21

180
1.65 (3.98)
0
30

25
0.23 (0.66)
0
4

12
0.11 (0.52)
0
3

Linguistic Resources:

  • AbreMES-DB: The Spanish Medical Abbreviation DataBase. Abbreviations are extracted from the metadata of different biomedical publications written in Spanish, which contain the titles and abstracts. Download from ZENODO.
  • MEDDOCAN-Gazetteer: Gazetteer of MEDDOCAN related entities. Includes names, surnames, addresses, hospitals, professions, and different types of locations (provinces, cities, towns, etc.). Download it from here.
  • Sentence-splitted test-set : Sentence splitted test set (including background set), computed using SPACCC_POS-TAGGER (see below). These annotations are mandatory to compute the leak score of subtrack 1. Download it from here.
  • SPACCC_POS-TAGGER: Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing. Download it from GitHub.