Description of the Corpus

Training and validation (annotated), test and background (unannotated) datsets
Guidelines

The SMM4H-Spanish corpus is a collection of 10,000 health-related tweets in Spanish annotated with disease mentions by a medical expert following carefully designed annotation guidelines proven to be useful to annotate both literature (clinical case reports) as well as EHRs. The aim of the corpus is to extract a diversity of different disease mentions from social media to enable further characterizing health-related issues of practical importance.

The data of the corpus was obtained from a Twitter crawl focussing on selected accounts covering patient associations and organizations, healthcare institutions and professionals as well as their followers with the aim to enrich this social media content to retrieve healthcare relevant tweets . This crawl was further filtered to obtain only the tweets that were written in Spanish with particular emphasis (but not exclusive) to profiles located in Spain and some Spanish speaking countries.

The corpus was primarily annotated by medical experts in an iterative process that included the adaptation of medical document annotation guidelines specifically for this task. These guidelines will be publicly released together with the SocialDisNER corpus.

The annotation process was performed using the web-based tool brat. Below is an example of how the annotated tweets look like:

Sample annotation of the SocialDisNER SMM4H-Spanish corpus.

All in all, 10,000 tweets were annotated. They were split into 60% training (6,000), 20% development (2,000) and 20% test (2,000). The different splits will be released according to the track schedule and accesible on zenodo.

FORMAT

SocialDIsNER: Tweet disease mention detection. Annotations are stored in a tab-separated file with 5 columns:

tweet_id begin end type extraction