Training and validation (annotated), test and background (unannotated) datsets
The SMM4H-Spanish corpus is a collection of 10,000 health-related tweets in Spanish annotated with professions, employment statuses and other work-related activities. The aim of the corpus is to extract professions from social media to enable characterizing health-related issues, in particular in the context of COVID-19 epidemiology as well as mental health conditions.
The data of the corpus was obtained from a Twitter crawl that used keywords like “Covid-19”, “epidemia” (epidemic) or “confinamiento” (lockdown), as well as hashtags such as “#yomequedoencasa” (#istayathome), to retrieve relevant tweets. This crawl was further filtered to obtain only the tweets that were written both from Spain and in Spanish.
The corpus was annotated by linguist experts in an iterative process that included the creation of annotation guidelines specifically for this task. These guidelines are described and available for download here.
We have performed a consistency analysis of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919.
The annotation process was performed using the web-based tool brat. Below is an example of how the annotated tweets look like:
All in all, 10,000 tweets were annotated. They were split into 60% training (6,000), 20% development (2,000) and 20% test (2,000). The different splits can be downloaded here.
Track A – Tweet binary classification. Annotations are stored in a tab-separated file with 2 columns:
Track B – Tweet binary classification. Annotations are stored in a tab-separated file with 5 columns:
tweet_id begin end type extraction
In addition, the corpus of Track B is provided in Brat format and in the IOB tagging scheme. See it on Zenodo.