Biomedical Information overload
Since the beginning of the 21st century, we have been immersed in a digitalization process that, together with the advent of the information age, has facilitated the generation, dissemination, and access to digital content using software tools.
This generation and accumulation of information have been particularly relevant in the field of health and biomedicine, in which difficulties have arisen in accessing relevant information on specific topics not only for practitioners but also for researchers, public healthcare decision-makers, and other health professionals. Indexing these contents is essential to ensure access to relevant information.
Among the variety of health-related data (i.e. blogs and social media, drug labels, or electronic health records), there are some that are especially useful for clinical medicine and industry:
- Medical literature plays a central role in evidence-based medicine, preparing systematic reviews or finding particular clinical case studies. Physicians consult bibliographic sources to know the latest progress in medical treatments and clinical trials, especially in new diseases such as COVID19.
- Health patents are an essential tool in the competitive intelligence strategies of healthcare companies. Industry stakeholders use them to make decisions on future investment in R+D, entering new markets and accelerate the return on investment of new company assets.
Medical Semantic Indexing
Content indexing is essential to ensure access to relevant information. In recent years, Information Retrieval systems applied to search engines have been improved through query expansion approaches, which are often based on previous manual indexing of records with structured vocabularies that facilitate the use of more powerful document search engines. However, manual indexing is highly time-consuming, expensive, and laborious
Semantic indexing with medical vocabularies has resulted in a good solution to reduce costs and the time bottlenecks in document indexing. The importance of these technologies motivated several-shared tasks in the past, in particular the BIOASQ tracks, with a considerable number of participants and impact in the field for medical literature in English.
Most of the Biomedical NLP and IR research is being done on English documents, but it is important to note that there is a considerable amount of medically relevant content published in other languages than English. Something especially relevant in clinical texts, which are entirely written in the native language of each country, with a few exceptions.
Importance of Spanish in Healthcare Documents
Spanish is a language spoken by more than 572 million people in the world today, either as a native, second or foreign language. It is the second language in the world by number of native speakers with more than 477 million people. According to results derived from WHO statistics, just in Spain there are over 180 thousand practicing physicians, more than 247 thousand nursing and midwifery personnel, and 55 thousand pharmaceutical personnel.
These facts, and extrapolation to other Spanish-speaking countries, might explain why there is a large subset of medical content published in Spanish each year. In medical literature, resources like PubMed contain only a fraction of the biomedical and medical literature originally published in Spanish, which is also stored in other resources such as IBECS, SCIELO or LILACS. In addition, most clinical trials conducted by Spanish speakers are not published in PubMed, but in specific databases such as the Registro Español de estudios clínicos (REec) where clinical trials authorized by the Spanish Drug Agency are published. Something similar happens with patents, which are usually published in Spanish to protect the results locally, before obtaining the global patent, which is then published in English.
Following the outline of previous medical indexing efforts, in particular, the success of the BioASQ tracks centered on PubMed, we propose to carry out the second edition of the task on semantic indexing of Spanish health-related texts.
About the task
The MESINESP2 shared-task invites researchers, medical, and industry professionals to develop automatic semantic indexing systems with structured medical vocabularies for Spanish documents. The main aim of MESINESP is to promote the development of semantic indexing tools of practical relevance of non-English content, determining the current state-of-the-art, identifying challenges, and comparing the strategies and results to those published for English data.
MESINESP2 task is divided in three subtracks:
- MESINESP-L – Scientific Literature [Subtrack 1]: This track will require automatic indexing with DeCS terms of abstracts using two highly used databases in Spanish (IBECS and LILACS).
- MESINESP-T- Clinical Trials [Subtrack 2]: This track will require automatic indexing with DeCS terms of clinical trials from REEC (Registro Español de Estudios Clínicos).
- MESINESP-P – Patents [Subtrack 3]: This track will require automatic indexing with DeCS terms the content of Spanish patents extracted from Google Patents.
For each of the subtracks, we have prepared a set of three corpora that can be examined deeply in the data section.
BIREME / PAHO / WHO collaborates with BSC in the development of the medical semantic indexing activity in Spanish, facilitating access to the bibliographic description metadata of LILACS and IBECS databases, available in the Virtual Health Library (VHL), in addition to the terms of the DeCS controlled vocabulary, which is the translation and extension of the MeSH in Spanish, Portuguese, French and English languages, used in the indexing process of the technical and scientific literature in health sciences.
. Landhuis, E. (2016). Scientific literature: Information Overload. Nature, 535(7612), 457-458.
. Giustini, D., & Boulos, M. N. (2013). Google Scholar is not enough to be used alone for systematic reviews. Online journal of public health informatics, 5(2), 214.
. Chen, D., Müller, H. M., & Sternberg, P. W. (2006). Automatic document classification of biological literature. BMC bioinformatics, 7, 370
. García-Armesto, S., Abadía-Taira, M. B., Durán, A., Hernández-Quevedo, C., Bernal-Delgado, E., & World Health Organization. (2010). Spain: Health system review.
. Rodriguez-Penagos, C., Nentidis, A., Gonzalez-Agirre, A., Asensio, A., Armengol-Estapé, J., Krithara, A., … & Krallinger, M. Overview of MESINESP8, a Spanish Medical Semantic Indexing Task within BioASQ 2020.