MESINESP: Medical Semantic Indexing in Spanish

LAST NEWS: For organizational reasons, the MESINESP task is delayed to winter. The new and definitive Training data set is published in the Datasets page, and a Development set will be provided in January (TBA). Check the Schedule.

Check the Datasets and the Evaluation pages for more information and details on changes regarding the data sets and the evaluation procedure.

The BioASQ MESINESP Task is sponsored by the Secretaría de Estado para el Avance Digital (SEAD) and the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Motivation

There is a pressing need to improve the access to information comprised in health and biomedicine related documents, not only by professional medical users buy also by researches, public healthcare decision makes and particularly by patients.

Among the variety of health related data (e.g. patents, blogs and social media, electronic health records, clinical trial repositories or drug labels), the medical literature plays a central role for evidence-based medicine, preparing systematic reviews or finding particular clinical case studies. Improved search engines, for instance through query expansion approaches, typically rely on previous manual indexing of records with structured vocabularies to facilitate more powerful literature search engines. Manual indexing is currently highly time consuming, costly and laborious.

Moreover semantic indexing technologies with structured vocabularies can also result in improved search engines for other documents or improve clinical coding of electronic health records.

There is an increasing demand to generate better retrieval strategies to medical literature beyond English, in order to recover important medical findings including clinical case studies published in other languages.

The critical importance of semantic indexing with medical vocabularies motivated several-shared tasks in the past, in particular the BioASQ tracks, with a considerable number of participants and impact in the field for medical literature in English.

Currently, most of the Biomedical NLP and IR research is being done on English documents, and only few tasks have been carried out on non-English texts. Nonetheless, it is important to note that there is also a considerable amount of medically relevant content published in other languages than English and particularly clinical texts are entirely written in the native language of each country, with a few exceptions.

Spanish is a language spoken by more than 572 million people in the world today, either as a native, second or foreign language. It is the second language in the world by number of native speakers with more than 477 million people. According to results derived from WHO statistics, just in Spain there are over 180 thousand practicing physicians, more than 247 thousand nursing and midwifery personnel or 55 thousand pharmaceutical personnel.

These facts, and the extrapolation to other Spanish speaking countries, might explain why there is a large subset of medical content published in Spanish each year. Resources like PubMed do only contain a fraction of the biomedical and medical literature originally published in Spanish, which is also stored in other resources such as IBECS, SCIELO or LILACS.

Following the outline of previous medical indexing efforts, in particular the success of the BioASQ tracks centered on PubMed, we propose to carry of the first task on semantic indexing of Spanish medical texts.

Example medical abstract and JSON object transformation sample for an article. .

Thus this task will address the automatic indexing with structured medical vocabularies (DeCS terms) of abstracts using two highly used databases (IBECS and LILACS) providing content in Spanish. The main aim of MESINESP is to promote the development of semantic indexing tools of practical relevance of non-English content, determining the current-state-of-the art, identifying challenges and comparing the strategies and results to those published for English data.