Additional Resources

by socialdisner

Evaluation Script

  • Official evaluation script: TBD

Linguistic Resources

  • CUTEXT. See it on GitHub.
    Medical term extraction tool.
    It can be used to extract relevant medical terms from tweets.
  • SPACCC POS Tagger. See it on GitHub.
    Part Of Speech Tagger for Spanish medical domain corpus.
    It can be used as a component of your system.
  • NegEx-MES. See it on Zenodo and on GitHub.
    A system for negation detection in Spanish clinical texts based on NegEx algorithm.
    It can be used as a component of your system.
  • AbreMES-X. See it on Zenodo.
    Software used to generate the Spanish Medical Abbreviation DataBase.
  • AbreMES-DB. See it on Zenodo.
    Spanish Medical Abbreviation DataBase.
    It can be used to fine-tune your system.
  • MeSpEn Glossaries. See it on Zenodo.
    Repository of bilingual medical glossaries made by professional translators.
    It can be used to fine-tune your system.
  • Occupations gazetteer. See it on Zenodo.
    A gazetter of occupations extracted from a set of terminologies (DeCS, ESCO, SnomedCT and WordNet) and Stanford CoreNLP.

Word embeddings

  • FastText Spanish medical embeddings. See them on Zenodo.
    Word and subword embeddings trained for medical Spanish domain.
    It can be used as a component of your system.
  • FastText Spanish Twitter embeddings. See them on Zenodo.
    Word and subword embeddings trained with Spanish Twitter data related to COVID-19.
    It can be used as a component of your system.

Baseline

See it on GitHub

More resources TBD

Evaluation

by socialdisner

The evaluation will be done at CodaLab. SocialDISNER submissions will be ranked by Precision, Recall, and F1-score for each ENFERMEDAD [disease] mention extracted, where the spans overlap entirely (F1-score is the primary metric). A correct prediction must have the same beginning and ending offsets as the Gold Standard annotation

Evaluation dates

July 11, 2022, midnight –  July 15, 2022, UTC

Submission format

Participating teams will have to generate a *.tsv file with the annotations detected for each of the test documents according to the following column structure:

  • tweets_id: Name of the file from which the disease mention has been extracted.
  • begin: Character number where the detected mentions start.
  • end: Character number where the detected mention ends.
  • type: In this case it will always be ENFERMEDAD
  • extraction: Mention extracted from text
tweets_id	begin	end	type	extraction
1357198223706894339	12	19	ENFERMEDAD	alergia
1357198223706894339	21	26	ENFERMEDAD	covid

CodaLab

Predictions for each subtask should be contained in a single .tsv (tab-separated values) file. This file (and only this file) should be compressed into a .zip file. Please upload this zip file as a submission. For the evaluation phase which will start on the 11th of July, you are allowed to add the validation set to the training set for training purposes.

Codalab: https://codalab.lisn.upsaclay.fr/competitions/3531

  1. Register and wait for approval
  2. To make submissions : Participate -> Submit/View Results -> Click on Task -> Click Submit -> Select File
    Refresh your submission. It goes from Submitted -> Running -> Finished. Scores should be available in the files. You can choose to submit your best scores to the Leaderboard.
  3. To view results : Results -> Click on Task -> View results in table

You will be allowed to make unlimited submissions during the validation stage. During the evaluation stage only 2 submissions will be allowed.

A step-by-step tutorial of the previous SMM4H 2021 ProfNER tack (check sub-track B) format is found here!

Registration

by socialdisner

To participate in the task, please be sure to complete the following steps:

  1. Register on Google Form. Please, choose a team name you remember since we will use it throughout the whole competition and select the Task 10 option:
  2. Login/Register in our CodaLab task in order to upload your predictions:

Important note: Student registrants are required to provide the name and email address of a faculty team member who has agreed to serve as their advisor/mentor for developing their system and writing their system description (see below). By registering for a task, participants agree to run their system on the test data and upload at least one set of predictions to CodaLab. Teams may upload up to three sets of predictions per task. By receiving access to the annotated tweets, participants agree to Twitter’s Terms of Service and may not redistribute any portion of the data.

Annotation Guidelines

by socialdisner

The SocialDisNER corpus of the SMM4H 2022 track was manually annotated by medical experts following the SMM4H-SocialDisNER guidelines. These guidelines were adapted from previous versions used to annotate EHRs and medical literature (clinical case reports) and contain rules for annotating mentions of diseases in health-related tweets in Spanish. Additionally, they also include some considerations regarding the codification of the annotations to SNOMED-CT concept codes.

Guidelines were created de novo in three phases:

  1. First, an initial version of the guidelines was adapted from clinical annotation guidelines after annotating an initial batch of ~500 tweets and outlining the main problems and difficulties of the social media data.
  2. Second, a stable version of guidelines was reached while annotating sample sets of the SocialDisNER corpus iteratively until quality control was satisfactory.
  3. Third, guidelines are iteratively refined as manual annotation continues.

The annotation guidelines are available at Zenodo.