Additional resources – MESINESP2: MEDICAL SEMANTIC INDEXING IN SPANISH

Extra datasets

Semantic indexing of content is a demanding and complex task. Within a text, there may be presence of some elements, such as mentions of medications or diseases, that could facilitate the prediction of DeCS descriptors by the participants’ models. For example, the Figure shows the mentions of medical procedures and diseases found in the abstract of a COVID-10 scientific manuscript. There are mentions of pneumonia, thrombosis, and acute respiratory distress that could facilitate the prediction of COVID-19 related labels.

The Text Mining Unit of the Barcelona Supercomputing Center has been working on BioMedical NERs for many years. Therefore, in order to provide additional data to improve the performance of your models, we have applied some of our NER models and provided you with the results. Specifically, we have extracted entities related to medications, diseases, symptoms, and medical procedures. We provide this information to the participants as additional data in the “Additional Data” folder of Zenodo. For each training, development and test set there is an additional json file with the structure shown below, where span refers to the character string of the entity found; start is the character number within the text where the entity begins; and end represents the number of the character where the entity ends.

{
  "articles": [
    {
      "id": "Id of the article",
      "diseases": [
		{"span": "this is a disease", "start": "1", "end": "19"},
                {"span": "this is a another disease", "start": "125", "end": "64"}
      ],
      "medications": [],
      "procedures": [],
      "symptoms": []
    }
  ]
}

The following table shows some of the statistics of the entities found in MESINESP2 datasets:

	Diseases	Medications	Procedures	Symptoms
Subtrack 1 training Total mentions Avg mentions/doc (Std) Min Max	708175 2.84 (3.55) 0 47	85913 0.34 (1.49) 0 53	361440 1.43 (2.33) 0 33	127233 0.51 (1.33) 0 34
Subtrack 1 development Total mentions Avg mentions/doc (Std) Min Max	3576 3.36 (4.11) 0 226	1237 1.16 (2.48) 0 19	1487 1.40 (2.32) 0 18	577 0.54 (1.39) 0 13
Subtrack 2 training Total mentions Avg mentions/doc (Std) Min Max	125294 34.88 (19.83) 0 146	83279 23.18 (16.25) 0 185	50629 14.09 (9.74) 0 68	9786 2.72 (4.20) 0 46
Subtrack 2 development Total mentions Avg mentions/doc (Std) Min Max	4338 29.51 (20.74) 0 99	3024 20.57 (18.48) 0 99	1937 13.18 (12.05) 0 64	354 2.41 (4.00) 0 24
Subtrack 3 development Total mentions Avg mentions/doc (Std) Min Max	171 1.57 (3.49) 0 21	180 1.65 (3.98) 0 30	25 0.23 (0.66) 0 4	12 0.11 (0.52) 0 3

Linguistic Resources:

AbreMES-DB: The Spanish Medical Abbreviation DataBase. Abbreviations are extracted from the metadata of different biomedical publications written in Spanish, which contain the titles and abstracts. Download from ZENODO.
MEDDOCAN-Gazetteer: Gazetteer of MEDDOCAN related entities. Includes names, surnames, addresses, hospitals, professions, and different types of locations (provinces, cities, towns, etc.). Download it from here.
Sentence-splitted test-set : Sentence splitted test set (including background set), computed using SPACCC_POS-TAGGER (see below). These annotations are mandatory to compute the leak score of subtrack 1. Download it from here.
SPACCC_POS-TAGGER: Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing. Download it from GitHub.