Data – MESINESP: Medical Semantic Indexing in Spanish

About VHL, LILACS and IBECS.

The Virtual Health Library (VHL) is a library, a decentralized and dynamic information-source collection, designed to provide access to scientific knowledge on health. It is maintained by BIREME, a PAHO (PanAmerican Health Organization) Specialized Center in three languages (English, Portuguese and Spanish).

The VHL is a Network of Networks built collectively and coordinated by BIREME. It is developed, by principle, in a decentralized manner by national instances (Argentina VHL, Brazil VHL etc.) and thematic networks of institutions related to research, education and health services (Nursing VHL, VHL Ministry of Health etc.).

To learn more about the VHL Model please visit http://red.bvsalud.org/en/

The information sources are selected according to criteria approved by the VHL Network. The index is updated weekly through harvesting of metadata from the collection of information sources.

LILACS is the most important and comprehensive index of scientific and technical literature of Latin America and the Caribbean. It includes 26 countries, 882 journals and 878.285 records, 464.451 of which are full texts.

IBECS (Índice Bibliográfico Español en Ciencias de la Salud) includes bibliographic references from scientific articles in health sciences published in Spanish journals.

You can access both databases through the Virtual Health Library (VHL) Portal.

LILACS and IBECS contents are indexed following the LILACS methodology. You can download the LILACS indexing guidelines (only in Spanish).

Additional references about LILACS can be found here.

Resources

Supplementary resources for training

PubMed abstracts translated into Spanish (to be published). We are translating all PubMed English abstracts into Spanish. This ‘Spanish’ version will include the associated DeCs codes as generated from the original MeSH descriptors.

Linguistic Resources

AbreMES-DB: The Spanish Medical Abbreviation DataBase. Abbreviations are extracted from the metadata of different biomedical publications written in Spanish, which contain the titles and abstracts. Download from ZENODO.
MEDDOCAN-Gazetteer: Gazetteer of MEDDOCAN related entities. Includes names, surnames, addresses, hospitals, professions, and different types of locations (provinces, cities, towns, etc.). Download it from here.
Sentence-splitted test-set : Sentence splitted test set (including background set), computed using SPACCC_POS-TAGGER (see below). These annotations are mandatory to compute the leak score of subtrack 1. Download it from here.
SPACCC_POS-TAGGER: Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing. Download it from GitHub.

DeCS Resources

DeCS descriptors 2019 (table with DeCS codes plus the descriptors & synonyms from both European and Latin Spanish DeCs data sets, separated by pipes).

DeCS Headings

DeCS (Health Sciences Descriptors) is a trilingual and structured vocabulary created by BIREME to serve as a unique language in indexing articles from scientific journals, books, conference proceedings, technical reports, and other types of materials, as well as for searching and retrieving subjects from scientific literature from information sources available on the Virtual Health Library (VHL) such as LILACS, MEDLINE, among others.

It was developed from the MeSH – Medical Subject Headings of the U.S. National Library of Medicine (NLM).

DeCS is part of the LILACS Methodology and is an integrating component of the Virtual Health Library. DeCS participates in the unified terminology development project, UMLS – Unified Medical Language System of the NLM, with the responsibility of contributing with the terms in Portuguese and Spanish.

The concepts that characterize the DeCS vocabulary are organized in a tree structure allowing a search on broader or narrower terms or on all terms from the same tree within the hierarchical structure.

Besides the MeSH terms, DeCS also includes terminology in specific areas such as Public Health, Homeopathy, Science and Health, and Health Surveillance in addition to the original MeSH terms.

The training data set lists the DeCs codes assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table with:

DeCs code
Preferred descriptor (the label used in the European DeCs 2019 set)
List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

For more information check BioASQ Spanish Track.

Datasets

The Training Dataset

The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows:

http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

To learn more on VHL, LILACS and IBECS see About the VHL, LILACS and IBECS section.

To filter the records with non empty abstracts, we only select those with the XML element ab_es different from ‘null’ or ‘No disponible’ (‘Not available’ in Spanish). You can see an XML record from here . To ensure that all abstracts in the data source are really written in Spanish, we have used a language guesser (langdetect) that verifies this condition. This script identified non Spanish abstracts in the ‘ab_es’ XML element and the corresponding records 1797 were removed from the data set (mostly text in English and Portuguese).

The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.

The eventual data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:

Original Train set with 369,368 records that also include the qualifiers, as retrieved from VHL. Download the Original Train set from here.
Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers. Download the Pre-processed Train set from here.

The table below describes abstracts’ length in the original set (measured in the number of characters)

Abstracts’ length (measured in characters)

Min: 12

Avg: 1140.41

Median: 1094

Max: 9428

The DeCS codes are distributed as follows (for more information about DeCs codes distribution in training set check here)

Number of DeCS codes per file

Min: 1

Avg: 8.12

Median: 7

Max: 53

The training data sets are distributed as a JSON file with the following format:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database",
      "decsCodes": [
        "code1",
        "code2",
        "code3"
      ]
    }
  ]
}

Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table with:

DeCs codes
Preferred descriptor (the label used in the European DeCs 2019 set)
List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively.

Development and Test Data sets

The MESINESP task will provide a Development and a Test data sets generated as follows:

1,000 abstracts will be manually annotated by two human experts.
A cross validation process will identify discrepant annotations.
In case of discrepancy, a consensus phase will begin in which the two annotators must agree on their annotations.
In the event that, during the testing phase, LILACS / IBECS published new indexations, these will be added to the consensus process. In this case, the consensus should take into account the annotations of the two human experts plus the new annotations coming from LILACS / IBECS.
Participants will have 500 manually annotated abstracts as a Development set or dry run.
The remaining 500 will comprise the Test set that will be used for the evaluation of the task.

The data of each test set will be served as JSON strings.

The format of the development set data in the JSON string will be the same as the training set.

The format of the test set data in the JSON string will be the following:

{
  "articles": [
    {
      "id": "Id of the article",
      "title": "Title of the article",
      "abstractText": "Content of the abstract",
      "journal": "Name of the journal",
      "year": 2018,
      "db": "Name of the database"
    }
  ]
}

This JSON has the key articles which contains an array with document objects. Each object has the keys id, title, abstractText, journal, year and db.

The Development and Test sets will be available for download on the date announced in the Schedule.

Annotation Guidelines & Consistency

IBECS (Índice Bibliográfico Español en Ciencias de la Salud) includes bibliographic references from scientific articles in health sciences published in Spanish journals.

You can access both databases through the Virtual Health Library (VHL) Portal.

LILACS and IBECS contents are indexed following the LILACS methodology. You can download the LILACS indexing guidelines (only in Spanish).

Additional references about LILACS can be found here.

Consistency

We used repeated articles in VHL to compare indexations and analyze consistency as follows:

We extracted all articles with identical title and abstract.
We manually exclude those that, despite having the same title and abstract, were unlikely to be the same article. This included: articles with abstracts too short; articles of a certain type (series) and/or with too generic title & abstracts.

This gave us 763 articles repeated at least 2 times. The set includes 374 repetition groups.

For each repeated article, we compared the indexations. For example, the following 5 articles have identical titles and abstracts and, consequently, we run an all-against-all comparison as listed in the table below.

https://pesquisa.bvsalud.org/portal/resource/pt/ibc-143700?lang=es
https://pesquisa.bvsalud.org/portal/resource/pt/ibc-138461?lang=es
https://pesquisa.bvsalud.org/portal/resource/pt/ibc-140304?lang=es
https://pesquisa.bvsalud.org/portal/resource/pt/ibc-133858?lang=es
https://pesquisa.bvsalud.org/portal/resource/pt/ibc-134090?lang=es

	143700	138461	140304	133858	134090
143700		38.9 %	27.8 %	31.2 %	33.3 %
138461			29.4 %	53.8 %	46.2 %
140304				20.0 %	21.4 %
133858					66.7 %
134090

The formula used is: Consistency = (Matched labels / Used labels)*100. Where Matched labels is the number of matching labels used by A & B and Used labels is the number of unique labels used.

The average max and min for all ‘repeated data’ set are as follows:

AVG	MAX	MIN
43.29409449 %	94.1	5.6