Multilingual corpus

by livingner

Download the LivingNER corpus (including the multilingual corpus) from zenodo

We have generated the annotated (and normalized to NCBI Taxonomy) training and validation sets in 6 languages: English, Portuguese, Catalan, Italian, French, and Romanian. The process was:

  1. The text files were translated with a neural machine translation system.
  2. The annotations were translated with the same neural machine translation system.
  3. The translated annotations were transferred to the translated text files using an annotation transfer technology.

If you want to visualize the multilingual resources, check out this Brat server: https://temu.bsc.es/mLivingNER/#/translations/
For instance, you can see the parallel annotations in English vs in French, or in Spanish (the gold standard) vs in Italian.

LivingNER Multilingual corpus overview

LivingNER Annotation Guidelines

by livingner

Download the annotation guidelines from Zenodo.

The LivingNER corpus was manually annotated by clinical experts following annotation guidelines specifically created for this task. These guidelines contain rules for annotating species and infectious diseases in clinical cases in Spanish. Infectious diseases are not included in this task. Additionally, they also include some considerations regarding the codification of the annotations to the NCBI Taxonomy.

Guidelines were created de novo in three phases:

  1. First, a zero version of the guidelines was developed after annotating an initial batch of ~40 clinical cases and outlining the main problems and difficulties of the data.
  2. Second, a stable version of guidelines was reached while annotating sample sets of the LivingNER corpus iteratively until quality control was satisfactory.
  3. Third, guidelines are iteratively refined as manual annotation continues.

The annotation guidelines are available in Zenodo.

Post-annotation review steps:LivingNER Annotation Guidelines

LivingNER corpus post-annotation review steps:

  • Consistency review: all annotations were searched for occurrences in all documents and a clinical expert reviewed whether they should be added to the annotations.
  • False positives: all annotations for the entities “neonatología”, “personalidad”, “cocaína”, “sociofamiliares”, “politraumatizado” were eliminated.
  • False negatives: the occurrences in the text of “prion”, “contacto sexual”, “oportunistas”, “enfermedades oportunistas”, “fascitis necrosante”, “fascitis necrotizante”, “probióticos” were reviewed and revised if they should be annotated because they are very important.
  • Consistency in the annotation of labels: there are mentions that sometimes are SPECIES and sometimes ENFERMEDAD and they were reviewed.
  • Validation of standardization: all codes were checked to ensure that they were in the official version of NCBI Taxonomy.
  • Consistency of standardization: there are some mentions that have different codes depending on the context. These were reviewed.
  • Review of unmapped entities: we reviewed all mentions without codes.
  • Checking internal line breaks: annotations with line break characters inside them were removed as they span more than one line.
  • Annotation starting and ending characters: all annotations are checked to ensure that they start and end with an alphanumeric character or a parenthesis. For example (“,adenocarcinoma,” would be an erroneous annotation since it is surrounded by commas).

FAQ

Email your doubts to Antonio Miranda (antoniomiresc@gmail.com)

  1. Q: What is the goal of the shared task?
    The goal is to predict the annotations (or codes) of the documents in the test and background sets. The goal of the subtask 2 (CLINICAL IMPACT) is to predict the document category and the evidence for those categories
  2. Q: How do I register?
    Here: https://temu.bsc.es/livingner/2022/01/28/registration/
  3. Q: How to submit the results?
    We will provide further information in the following days.
    Download the example ZIP file.
    See Submission page for more info.
  4. Q: Can I use additional training data to improve model performance?
    Yes, participants may use any additional training data they have available, as long as participants describe it in the working notes. We will ask you to summarize such resources in your participant paper.
  5. Q: The task consists of three sub-tasks. Do I need to complete all sub-tasks? In other words, If I only complete a sub-task or two sub-tasks, is it allowed?
    Sub-tasks are independent and participants may participate in one, two, or the three of them.
  6. Q: How can I submit my results? Can I submit several prediction files for each sub-task?
    You will have to create a ZIP file with your predictions file and submit it to EasyChair (further details will be soon released).
    Yes, you can submit up to 5 prediction files, all in the same ZIP.
    Download the example ZIP file.
    See Submission page for more info.
  7. Q: Should prediction files have headings?
    No, prediction files should have no headings.
  8. Q: Are all codes and mentions equally weighted?
    Yes.
  9. Q: LivingNER-NORM and CLINICAL Impact. What version of the NCBI Taxonomy is in use?
    We are using the latest version available in January 2021.
    There is a complete list of the valid codes on Zenodo. Codes not present in this list will not be used for the evaluation.
  10. Q: LivingNER-NORM and CLINICAL Impact. What is meant by the /H appended to various codes?
    Some SPECIES mentions are more specific than the most granular term available in the NCBI Taxonomy. Then, we append /H to the code.
    For example, “K pneumoniae BLEE” is not specified in the NCBI Taxonomy. But “Klebsiella pneumonia” is (code 573). Then, we assign 573|H.
  11. Q: LivingNER-NORM and CLINICAL Impact. What do the codes separated by a “|” mean?
    Some SPECIES mentions are only correctly described by a combination of NCBI Taxonomy codes. For instance, “virus B y C de la hepatitis” does not exist as such in NCBI Taxonomy. However, we may express it as a combination of the NCBI Taxonomy terms “Hepatitis B virus” (10407) and “Hepacivirus C” (11103). Then, we assign 10407|11103.
  12. Q: LivingNER-NORM and CLINICAL Impact. If a predicted mention has several codes, do I need to provide them in some particular order?
    No. The evaluation library accepts combined codes in any order.
  13. Q. In LivingNER-App. When one of the entities results in a NOCODE (the placeholder for when the NCBI taxonomy does not contain any suitable code), this NOCODE marker should appear in the final output for subtrack3? Or should we totally omit them?
    As an example, totally made up (see the bold part):

    (1) caso_clinico_neurologia78 Yes NOCODE No NA Yes 9031+NOCODE No NA <– adding the NOCODE just like it was any other code
    or
    (2) caso_clinico_neurologia78 Yes NA No NA Yes 9031 No NA <— totally ignoring the NOCODES putting an NA when no other evidence is available despite the flag being “Yes”
    The correct answer is the first one. As a general rule, every time you try to add an NCBI code but your mention is not in the scope of the terminology, you should add NOCODE. That is what we did in the manual annotation process (https://doi.org/10.5281/zenodo.6385162)

Scientific Committee

by livingner
  • Tome Eftimov, Jožef Stefan Institute, Ljubljana, Slovenia
  • Irena Spasic, School of Computer Science & Informatics, co-Director of the Data Innovation Research Institute, Cardiff University, UK
  • Kirk Roberts, School of Biomedical Informatics, University of Texas Health Science Center, USA
  • Felipe Bravo, Assistant Professor, University of Chile, Chile
  • Karin Verspoor, School of Computing and Information Systems, Health and Biomedical Informatics Centre, University of Melbourne, Australia
  • Tristan Naumann, Microsoft Research Healthcare NExT, USA
  • Claire Nédellec, University Paris-Saclay, INRAE, France
  • Enea Parimbelli, Assistant Professor at University of Pavia, Italy
  • Casimiro Pio Carrino, Research Engineer at Barcelona Supercomputing Center, Spain
  • Zhiyong Lu, Deputy Director for Literature Search, National Center for Biotechnology Information (NCBI), USA
  • Ashish Tendulkar, Google Research
  • Rosa Estopà Bagot, Universitat Pompeu Fabra, Spain
  • Carlos Luis Parra Calderón, Head of Technological Innovation, Virgen del Rocío University Hospital, Institute of Biomedicine of Seville, Spain
  • Thierry Declerck, Senior Consultant at DFKI GmbH, Germany
  • Ernestina Menasalvas, Universidad Politécnica de Madrid, Spain
  • Koldo Gojenola, University of the Basque Country, Spain
  • Aurélie Névéol, LIMSI-CNRS, Université Paris-Sud, France
  • Giorgio Maria Di Nunzio, University of Padua, Italy
  • Anália Lourenço, Universidade de Vigo, Spain
  • Frank Emmert-Streib, Tampere University, Finland
  • Pablo Serrano, Planning Director at Hospital 12 de Octubre, Spain
  • Yoan Gutiérre, University of Alicante, Spain
  • Vasile Păiș, Research Institute for Artificial Intelligence “Mihai Drăgănescu”, Romanian Academy
  • Enrique Carrillo de Santa Pau, IMDEA Food Institute, Spain
  • Giorgio Maria Di Nunzio, University of Padua, Italy

Task Organizers

by livingner
  • Antonio Miranda-Escalada, Text Mining Unit, Barcelona Supercomputing Center, Spain
  • Eulàlia Farré, Text Mining Unit, Barcelona Supercomputing Center, Spain
  • Salvador López Lima, Text Mining Unit, Barcelona Supercomputing Center, Spain
  • Martin Krallinger, Text Mining Unit, Barcelona Supercomputing Center, Spain

Workshop

LivingNER will be part of the IberLEF (Iberian Languages Evaluation Forum) 2022 evaluation campaign at the SEPLN 2022 (38th Annual Congress), that takes place in September at A Coruña (Spain)

A coruña

IberLEF aims to foster the research community to define new challenges and obtain cutting-edge results for the Natural Language Processing community, involving at least one of the Iberian languages: Spanish, Portuguese, Catalan, Basque or Galician. Accordingly, several shared-tasks challenges are proposed.

LivingNER participants will have the opportunity to publish their system descriptions at the IberLEF proceedings.

Also, LivingNER participants will be selected to present their system descriptions at the IberLEF 2022 workshop (Sept 2022).

Have a look at the 2020 Cantemist presentations (here) and 2021 Meddoprof presentations (here)

Resources

by livingner

LivingNER Resources

LivingNER terminology – codes Reference List

  • Reference list with all valid codes from NCBI Taxonomy with the terms translated to Spanish. It is a .tsv file. Available in Zenodo

LivingNER evaluation script

Other Resources

Word embeddings

  • Spanish Medical Word Embeddings. Word embeddings generated from Spanish medical corpora. Download them from Zenodo.
    It can be used as a building block for clinical NLP systems used in Spanish texts.

Linguistic Resources

  • CUTEXT. See it on GitHub.
    Medical term extraction tool.
    It can be used to extract relevant medical terms from clinical cases.
  • SPACCC POS Tagger. See it on Zenodo.
    Part Of Speech Tagger for Spanish medical domain corpus.
    It can be used as a component of your system.
  • NegEx-MESSee it on Zenodo.
    A system for negation detection in Spanish clinical texts based on NegEx algorithm.
    It can be used as a component of your system.
  • Negation corpusSee it on GitHub
    A Corpus of Negation and Uncertainty in Spanish Clinical Texts (and instructions to train the system).
  • AbreMES-X. See it on Zenodo.
    Software used to generate the Spanish Medical Abbreviation DataBase.
  • AbreMES-DB. See it on Zenodo.
    Spanish Medical Abbreviation DataBase.
    It can be used to fine-tune your system.
  • MeSpEn GlossariesSee it on Zenodo.
    Repository of bilingual medical glossaries made by professional translators.
    It can be used to fine-tune your system.

LivingNER corpus description

by livingner

Download the corpus from zenodo

This page contains the following information:

  1. LivingNER corpus General Information
  2. LivingNER corpus format

1. General information

The LivingNER Gold Standard consists of a collection of 2000 clinical case reports that will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file. Additionally, we will also provide the annotation files comprising the character offsets of the entity mentions in TSV (tab-separated values) files together with their corresponding NCBI Taxonomy code annotations.

                        Figure 3. Example plain text LivingNER corpus document.

The clinical case reports come from 20 medical disciplines (enfermedades infecciosas (incluidos casos de Covid-19), cardiología, neurología, oncología, otorrinolaringología, odontología, pediatría, endocrinología, atención primaria, alergología, radiología, psiquiatría, oftalmología, psiquiatría, urología, medicina interna, emergencias y medicina de cuidados intensivos, radiología, medicina tropical y dermatología) with species [SPECIES] and [HUMAN] entities manually annotated.

The corpus’ content is quite varied, as it includes annotations for animals, plants, and microorganisms (including bacteria, fungi, viruses, and parasites). Both scientific names, as well as common names, were considered.

All of these mentions have been manually mapped to the NCBI taxonomy. Please, beware that:

  • Composite mentions. If several NCBI taxonomy codes were required to map a single annotated mention, the codes are concatenated with a “|” symbol. For instance, “microorganism” is mapped to “2|2759|10239”.
  • Terminology codes that are more general than the annotated mention. If the NCBI taxonomy concept was more general than the annotated mention, the modifier “H” is added to the NCBI taxonomy code. For instance, “baciloscopia” is mapped to “2|H”.

The final corpus will be randomly split into three subsets: training, development and test. In the case of training and development sets, additionally, to the clinical cases, a TSV file will be released. It will contain one row per annotation.

In addition to the test set, a larger background set of clinical case documents will be released to make sure that participating teams will not be able to do manual corrections.

The goal of the LivingNER task is to develop automatic systems for Spanish medical texts. These systems should rely on the use of the LivingNER corpus, a high-quality Gold Standard synthetic clinical corpus of 2000 records based on a manual annotation process done by human experts together with an inter-annotator agreement consistency analysis.

2. Corpus format

For subtask 1 (LivingNER – Species NER), annotations are distributed in a tab-separated file (TSV) file with the following columns:

  • filename: document name
  • mark: identifier mention mark 
  • label: mention type (SPECIES or HUMAN)
  • off0: starting position of the mention in the document
  • off1: ending position of the mention in the document
  • span: textual span
 Figure 1. Example annotation for LivingNER-Species NER Track.

For subtask 2 (LivingNER – Species Norm), annotations are distributed in a TSV file with the same columns as the previous one, plus:

  • isH: whether the span is narrower than the NCBITax assigned code 
  • isN: whether the mention corresponds to a nosocomial infection
  • iscomplex: whether the span has assigned a combination of NCBITax codes
  • NCBITax: mention code in the NCBI Taxonomy
Figure 2. Example annotation for LivingNER-Species Norm Track.

For subtask 3 (LivingNER –  Clinical IMPACT), annotations are distributed in a (TSV) with the following columns:

  • filename
  • isPet (Yes/No)
  • PetIDs (NCBITaxonomy codes of pet & farm animals present in document)
  • isAnimalInjury (Yes/No)
  • AnimalInjuryIDs (NCBITaxonomy codes of animals causing injuries present in document)
  • IsFood (Yes/No)
  • FoodIDs (NCBITaxonomy codes of food mentions present in document)
  • isNosocomial (Yes/No)
  • NosocomialIDs (NCBITaxonomy codes of nosocomial species mentions present in document)
Figure 3. Example annotation for LivingNER-Clinical Impact Norm Track.

All text files are distributed as plain UTF-8 text files, where each clinical case would be stored as a single file.