Training and validation (annotated), test and background (unannotated) datasets


The SMM4H-Spanish corpus was manually annotated by linguist experts following the SMM4H-Spanish guidelines. These guidelines contain rules for annotating professions, employment statuses and work-related activities in health-related tweets in Spanish. Additionally, they also include some considerations regarding the codification of the annotations to the ESCO and SNOMED-CT taxonomies.

Guidelines were created de novo in three phases:

  1. First, a zero version of the guidelines was developed after annotating a initial batch of ~200 tweets and outlining the main problems and difficulties of the data.
  2. Second, a stable version of guidelines was reached while annotating sample sets of the ProfNER corpus iteratively until quality control was satisfactory.
  3. Third, guidelines are iteratively refined as manual annotation continues.

The annotation guidelines are available in Spanish here and in English here.