The Training Dataset

The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows:

http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

To learn more on VHL, LILACS and IBECS see About the VHL, LILACS and IBECS section.

To filter the records with non empty abstracts, we only select those with the XML element ab_es different from ‘null’ or ‘No disponible’ (‘Not available’ in Spanish). You can see an XML record from here. To ensure that all abstracts in the data source are really written in Spanish, we have used a language guesser (langdetect) that verifies this condition. This script identified non Spanish abstracts in the ‘ab_es’ XML element and the corresponding records 1797 were removed from the data set (mostly text in English and Portuguese).

The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.

The eventual data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:

  • Original Train set with 369,368 records ​that also include the qualifiers, as retrieved from VHL. Download the Original Train set from here.
  • Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers. Download the Pre-processed Train set from here.

The table below describes abstracts’ length in the original set (measured in the number of characters)

Abstracts’ length (measured in characters)
Min: 12
Avg: 1140.41
Median: 1094
Max: 9428

The DeCS codes are distributed as follows (for more information about DeCs codes distribution in training set check here)

Number of DeCS codes per file
Min: 1
Avg: 8.12
Median: 7
Max: 53

The training data sets are distributed as a JSON file with the following format:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database",
      "decsCodes": [
        "code1",
        "code2",
        "code3"
      ]
    }
  ]
}

Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table with:

  • DeCs codes
  • Preferred descriptor (the label used in the European DeCs 2019 set)
  • List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively.

Development and Test Data sets

The MESINESP task will provide a Development and a Test data sets generated as follows:

  • 1,000 abstracts will be manually annotated by two human experts.
  • A cross validation process will identify discrepant annotations.
  • In case of discrepancy, a consensus phase will begin in which the two annotators must agree on their annotations.
  • In the event that, during the testing phase, LILACS / IBECS published new indexations, these will be added to the consensus process. In this case, the consensus should take into account the annotations of the two human experts plus the new annotations coming from LILACS / IBECS.
  • Participants will have 500 manually annotated abstracts as a Development set or dry run.
  • The remaining 500 will comprise the Test set that will be used for the evaluation of the task.

The data of each test set will be served as JSON strings.

The format of the development set data in the JSON string will be the same as the training set.

The format of the test set data in the JSON string will be the following:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database"
    }
  ]
}

This JSON has the key articles which contains an array with document objects. Each object has the keys id, title, abstractText, journal, year and db.

The Development and Test sets will be available for download on the date announced in the Schedule.