Scientific Literature dataset – MESINESP2: MEDICAL SEMANTIC INDEXING IN SPANISH

You can download the latest version of the dataset from this Zenodo page. Please check that you have downloaded the latest version of the datasets, since we will be including new resources during the shared task duration.

Training

The training dataset contains records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract/title written in Spanish. The URL used to retrieve records is as follows:

http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

We excluded records with abstract or empty title. We only select those with the XML element ab_es and ti_es different from ‘null’ or ‘No disponible’. To ensure all the titles and abstracts in the dataset are really written in Spanish, we have used a language detection library (langdetect) to verify this.

We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database.

In this year’s MESINESP, we distribute two different datasets:

Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.

The table below describes some descriptive statistics in those datasets:

	Articles training set Abstracts’ length	DeCS codes	Full training set Abstracts’ length	DeCS codes
Min.	1	1	1	1
Mean (Std)	1196.04 (488.19)	8.37 (3.50)	1196.68 (514.63)	8.32 (3.49)
Median	1156	8	1155	8
Max.	9243	37	9787	37

The training datasets are distributed as JSON files with the following structure:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database",
      "decsCodes": [
        "code1",
        "code2",
        "code3"
      ]
    }
  ]
}

The original XML does not contain decsCodes but descriptors. This year we have done the conversion of those descriptors to their correspondant decsCode using the DeCS 2020 conversion table containing:

DeCS Code
Preferred descriptor/term (the label used in the International DeCS 2020 set)
Synonym terms of the preferred descriptor

For more details about the Latin and European Spanish DeCS codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively. You can also read the page “DeCS Headings” of this website.

Development

For the scientific literature subtrack we provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the development set. From those 1065 records:

213 articles were annotated by more than one annotator. We have selected de union between annotations.
852 articles were annotated by only one of the three selected annotators with better performance.

Some statistics about the test set are shown in the following table:

	Articles training set Abstracts’ length	DeCS codes
Min.	44	3
Mean (Std)	1323.49 (419.54)	10.59 (4.10)
Median	1332	10
Max.	2825	32

The Development set is served as JSON strings with the same format as the training set.

Test

For this subtrack we provide a Test dataset generated as follows:

10,179 abstracts without DeCS codes have been gathered from LILACS/IBECS.
500 of them will be manually annotated by two human experts.
A cross-validation process will identify discrepant annotations.
In case of discrepancy, a consensus phase will begin in which the two annotators must agree on their annotations.
Participants will have to predict the DecS codes for each of the abstracts in the entire dataset. However, the evaluation of the systems will only be made on the set of 500 expert-annotated abstracts that will be published as Gold Standard after finishing the evaluation period.

The data of the test set will be served as JSON strings.

The format of the test set data in the JSON string will be the following:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": year,
      "db": "Name of the database"
    }
  ]
}