- READ-BioMED: https://github.com/READ-BioMed/socialdisner-2022
- PLN-CMM: https://github.com/plncmm/socialdisner
- NLP-CIC-WFU: https://github.com/ajtamayoh/NLP-CIC-WFU-Contribution-to-SocialDisNER-shared-task-2022
- RACAI: https://github.com/racai-ai/RNER
- SINAI: https://huggingface.co/chizhikchi/Spanish_disease_finder
- ITAINOVA: https://github.com/ITAINNOVA/SocialDisNER
Schedule
Status | Event | Date (UTC) | Link |
---|---|---|---|
✓ | Training set release | Mar 31 | Zenodo |
✓ | Development set release | Jun 14 | Zenodo |
✓ | Additional set 1 - 85k tweets w/ disease mentions (Silver Standard) | Jun 27 | Zenodo |
✓ | Validation predictions due [Practice Phase] [Required] | Jul 4 | - |
✓ | Additional set 2 - 85k tweets w/ additional mentions (Silver Standard) | Jul 6 | Zenodo |
✓ | Test set release (without annotations) | Jul 11 | Zenodo |
✓ | Test set predictions due [Evaluation Phase] | Jul 15 | - |
Test set evaluation scores release | Jul 25 | TBA | |
System descriptions due | Aug 1 | TBA | |
Acceptance notification | Aug. 15 | TBA | |
Camera ready system descriptions | Sep 1 | TBA | |
SMM4H workshop at Coling conference | Oct 12-17 | COLING 2022 |
Protected: Results
Scientific Committee
- Jey Han Hau, The University of Melbourne (Australia)
- Luca Maria Aiello, IT University of Copenhagen (Denmark)
- David Camacho, Universidad Politécnica de Madrid (Spain)
- Torsten Zesch, Fernuniversitat in Hagen (Germany)
- Eiji ARAMAKI, Nara Institute of Science and Technology (Japan)
- Rafael Valencia-Garcia, Universidad de Murcia (Spain)
- Antonio Jimeno Yepes, RMIT University (Australia)
- Carlos Gómez-Rodríguez, Universidad da Coruña (Spain)
- Paloma Martínez, Universidad Carlos III de Madrid (Spain)
- Anália Lourenço, Universidade de Vigo (Spain)
- Eugenio Martinez Cámara, Universidad de Granada (Spain)
- Gema Bello Orgaz, Universidad Politécnica de Madrid (Spain)
- Juan Antonio Lossio-Ventura, National Institutes of Health (USA)
- Héctor D. Menendez, King’s College London (UK)
- Manuel Montes y Gómez, National Institute of Astrophysics, Optics and Electronics (Mexico)
- Helena Gómez Adorno, Universidad Nacional Autónoma de México (Mexico)
- Rodrigo Agerri, IXA Group (HiTZ Centre), University of Basque Country EHU (Spain)
- Miguel A. Alonso, Universidad da Coruña (Spain)
- Ferran Pla, Universidad Politécnica de Valencia (Spain)
- Jose Alberto Benitez-Andrades, Universidad de Leon (Spain)
- More TBA
Task Organizers
SocialDisNER-ST is organized by:
- Luis Gasco, Barcelona Supercomputing Center, Spain
- Darryl Estrada, Barcelona Supercomputing Center, Spain
- Eulàlia Farré-Maduell, Barcelona Supercomputing Center, Spain
- Salvador Lima, Barcelona Supercomputing Center, Spain
- Martin Krallinger, Barcelona Supercomputing Center, Spain
SocialDisNER-ST is part of Social Media Mining for Health Applications (#SMM4H) Shared Task 2022, which is organized by:
- Graciela Gonzalez-Hernandez, University of Pennsylvania, USA
- Davy Weissenbacher, University of Pennsylvania, USA
- Arjun Magge, University of Pennsylvania, USA
- Ari Z. Klein, University of Pennsylvania, USA
- Ivan Flores, University of Pennsylvania, USA
- Karen O’Connor, University of Pennsylvania, USA
- Raul Rodriguez-Esteban, Roche Pharmaceuticals, Switzerland
- Lucia Schmidt, Roche Pharmaceuticals, Switzerland
- Juan M. Banda, Georgia State University, USA
- Abeed Sarker, Emory University, USA
- Yuting Guo, Emory University, USA
- Elena Tutubalina, Kazan Federal University, Russia
- Vera Davydova, Kazan Federal University, Russia
Description of the Corpus
Training and validation (annotated), test and background (unannotated) datsets
The SMM4H-Spanish corpus is a collection of 10,000 health-related tweets in Spanish annotated with disease mentions by a medical expert following carefully designed annotation guidelines proven to be useful to annotate both literature (clinical case reports) as well as EHRs. The aim of the corpus is to extract a diversity of different disease mentions from social media to enable further characterizing health-related issues of practical importance.
The data of the corpus was obtained from a Twitter crawl focussing on selected accounts covering patient associations and organizations, healthcare institutions and professionals as well as their followers with the aim to enrich this social media content to retrieve healthcare relevant tweets . This crawl was further filtered to obtain only the tweets that were written in Spanish with particular emphasis (but not exclusive) to profiles located in Spain and some Spanish speaking countries.
The corpus was primarily annotated by medical experts in an iterative process that included the adaptation of medical document annotation guidelines specifically for this task. These guidelines will be publicly released together with the SocialDisNER corpus.
The annotation process was performed using the web-based tool brat. Below is an example of how the annotated tweets look like:
All in all, 10,000 tweets were annotated. They were split into 60% training (6,000), 20% development (2,000) and 20% test (2,000). The different splits will be released according to the track schedule and accesible on zenodo.
FORMAT
SocialDIsNER: Tweet disease mention detection. Annotations are stored in a tab-separated file with 5 columns:
tweet_id begin end type extraction
Datasets
Train set
The train set contains 5,000 annotated tweets. Will be published on zenodo.
Validation set
The validation set contains 2500 annotated tweets. Will be published on zenodo.
Test and background sets
The test set contains 2500 tweets. The background set contains 50K tweets. Will be published on zenodo.
The test and background set will be published together. You will have to submit predictions for the whole set, but you will only be evaluated with the test set `predictions.
Test set with Gold Standard annotations
The Gold Standard annotations of the test set will be released after the submission deadline
Corpora Stats.
Training | Development | |
# Tweets | 5000 | 2500 |
# characters | 1253431 | 516768 |
# tokens | 211555 | 84478 |
Avg. char. /tweet | 250.69 | 206.71 |
Avg. Tok. /tweet | 42.31 | 33.79 |
# disease mentions | 15173 | 4252 |
# unique disease mentions | 4407 | 1413 |
Publications
SocialDisNER’s overview paper:
Luis Gasco Sánchez, Darryl Estrada Zavala, Eulàlia Farré-Maduell, Salvador Lima-López, Antonio Miranda-Escalada, and Martin Krallinger. 2022. The SocialDisNER shared task on detection of disease mentions in health-relevant content from social media: methods, evaluation, guidelines and corpora. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 182–189, Gyeongju, Republic of Korea. Association for Computational Linguistics.
URL: https://aclanthology.org/2022.smm4h-1.48/
SMM4H 2022 overview paper:
Davy Weissenbacher, Juan Banda, Vera Davydova, Darryl Estrada Zavala, Luis Gasco Sánchez, Yao Ge, Yuting Guo, Ari Klein, Martin Krallinger, Mathias Leddin, Arjun Magge, Raul Rodriguez-Esteban, Abeed Sarker, Lucia Schmidt, Elena Tutubalina, and Graciela Gonzalez-Hernandez. 2022. Overview of the Seventh Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2022. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 221–241, Gyeongju, Republic of Korea. Association for Computational Linguistics.
URL: https://aclanthology.org/2022.smm4h-1.54/
Participants papers:
- Jia Fu, Sirui Li, Hui Ming Yuan, Zhucong Li, Zhen Gan, Yubo Chen, Kang Liu, Jun Zhao, and Shengping Liu. 2022. CASIA@SMM4H’22: A Uniform Health Information Mining System for Multilingual Social Media Texts. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 143–147, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Antonio Jimeno Yepes and Karin Verspoor. 2022. READ-BioMed@SocialDisNER: Adaptation of an Annotation System to Spanish Tweets. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 48–51, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Harsh Verma, Parsa Bagherzadeh, and Sabine Bergler. 2022. CLaCLab at SocialDisNER: Using Medical Gazetteers for Named-Entity Recognition of Disease Mentions in Spanish Tweets. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 55–57, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Matias Rojas, Jose Barros, Kinan Martin, Mauricio Araneda-Hernandez, and Jocelyn Dunstan. 2022. PLN CMM at SocialDisNER: Improving Detection of Disease Mentions in Tweets by Using Document-Level Features. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 52–54, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Antonio Tamayo, Alexander Gelbukh, and Diego Burgos. 2022. NLP-CIC-WFU at SocialDisNER: Disease Mention Extraction in Spanish Tweets Using Transfer Learning and Search by Propagation. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 19–22, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Miguel Ortega-Martín, Alfonso Ardoiz, Oscar Garcia, Jorge Álvarez, and Adrián Alonso. 2022. dezzai@SMM4H’22: Tasks 5 & 10 – Hybrid models everywhere. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 7–10, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Andrei-Marius Avram, Vasile Pais, and Maria Mitrofan. 2022. RACAI@SMM4H’22: Tweets Disease Mention Detection Using a Neural Lateral Inhibitory Mechanism. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 1–3, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Antoine Lain, Wonjin Yoon, Hyunjae Kim, Jaewoo Kang, and Ian Simpson. 2022. KU_ED at SocialDisNER: Extracting Disease Mentions in Tweets Written in Spanish. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 78–80, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Mariia Chizhikova, Pilar López-Úbeda, Manuel C. Díaz-Galiano, L. Alfonso Ureña-López, and M. Teresa Martín-Valdivia. 2022. SINAI@SMM4H’22: Transformers for biomedical social media text mining in Spanish. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 27–30, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Rosa Montañés-Salas, Irene López-Bosque, Luis García-Garcés, and Rafael del-Hoyo-Alonso. 2022. ITAINNOVA at SocialDisNER: A Transformers cocktail for disease identification in social media in Spanish. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 71–74, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Kendrick Cetina and Nuria García-Santa. 2022. FRE at SocialDisNER: Joint Learning of Language Models for Named Entity Recognition. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 68–70, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Aman Sinha, Cristina Garcia Holgado, Marianne Clausel, and Matthieu Constant. 2022. IAI @ SocialDisNER : Catch me if you can! Capturing complex disease mentions in tweets. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 85–89, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Akbar Karimi and Lucie Flek. 2022. CAISA@SMM4H’22: Robust Cross-Lingual Detection of Disease Mentions on Social Media with Adversarial Methods. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 168–170, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Beatrice Portelli, Simone Scaboro, Emmanuele Chersoni, Enrico Santus, and Giuseppe Serra. 2022. AILAB-Udine@SMM4H’22: Limits of Transformers and BERT Ensembles. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 130–134, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Veysel Kocaman, Cabir Celik, Damla Gurbaz, Gursev Pirge, Bunyamin Polat, Halil Saglamlar, Meryem Vildan Sarikaya, Gokhan Turer, and David Talby. 2022. John_Snow_Labs@SMM4H’22: Social Media Mining for Health (#SMM4H) with Spark NLP. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 44–47, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Workshop
SocialDISNER will be part of the Social Media Mining for Health 2022 (#SMM4H) workshop at the COLING 2022 (the 29th International Conference On Computational Linguistics), that takes place in October at Gyeongju (Republic of Korea).
COLING is one of the leading conferences on natural language processing and computational linguistics and attracts participants from both top research centers and emerging countries.
SocialDISNER participants are required to write a short-paper describing the system(s) they ran on the test data. Some sample description systems can be found on pages 89-136 of the #SMM4H 2019 proceedings. Accepted system descriptions will be included in the #SMM4H 2022 proceedings.
We encourage at least one author of each accepted system description to register for the #SMM4H 2022 Workshop, co-located at COLING, and present their system as a poster. Selected participants, as determined by the program committee, will be invited to extend their system description to up to four pages, plus unlimited references, and present their system orally.
Contact & FAQ
Email Martin Krallinger to Krallinger.Martin@gmail.com , Luis Gasco to luis.gasco@bsc.es , and Darryl Estrada to darryl.estrada@bsc.es
- Q: What is the goal of the shared task?
The goal is to predict the named entities of the tweets in the test and background sets. - Q: How do I register?
Here: Google Form - Q: How do I submit the results?
In CodaLab. - Q: Can I use additional training data to improve model performance?
Yes, participants may use any additional training data they have available, as long as they describe it in the system description. - Q: Is there a Google Group for the SocialDisNER task?
Yes: Google Group