{"id":4003,"date":"2019-09-19T12:55:02","date_gmt":"2019-09-19T12:55:02","guid":{"rendered":"http:\/\/temu.bsc.es\/meddocan\/?p=4003"},"modified":"2021-02-28T20:08:21","modified_gmt":"2021-02-28T20:08:21","slug":"description-of-the-corpus","status":"publish","type":"post","link":"https:\/\/temu.bsc.es\/smm4h-spanish\/?p=4003","title":{"rendered":"Description of the Corpus"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><a rel=\"noreferrer noopener\" href=\"https:\/\/doi.org\/10.5281\/zenodo.4309356\" target=\"_blank\">Training and validation (annotated), test and background (unannotated) datsets<\/a><\/p><p><a rel=\"noreferrer noopener\" href=\"https:\/\/doi.org\/10.5281\/zenodo.4306016\" target=\"_blank\">Guidelines<\/a><\/p><\/blockquote>\n\n\n\n<p>The SMM4H-Spanish corpus is a collection of <strong>10,000 health-related tweets<\/strong> in Spanish annotated with professions, employment statuses and other work-related activities. The aim of the corpus is to extract professions from social media to enable characterizing health-related issues, in particular in the context of COVID-19 epidemiology as well as mental health conditions.<\/p>\n\n\n\n<p>The data of the corpus was obtained from a Twitter crawl that used keywords like &#8220;Covid-19&#8221;, &#8220;epidemia&#8221; (<em>epidemic<\/em>) or &#8220;confinamiento&#8221; (<em>lockdown<\/em>), as well as hashtags such as &#8220;#yomequedoencasa&#8221; (<em>#istayathome<\/em>), to retrieve<strong> relevant tweets<\/strong>. This crawl was further filtered to obtain only the tweets that were written both from Spain and in Spanish.<\/p>\n\n\n\n<p>The corpus was annotated by <strong>linguist experts<\/strong> in an iterative process that included the creation of annotation guidelines specifically for this task. These guidelines are described and available for download <a href=\"https:\/\/temu.bsc.es\/smm4h-spanish\/?p=3932\">here<\/a>.<\/p>\n\n\n\n<p>We have performed a <strong>consistency analysis<\/strong> of the corpus. 10% of the documents have been annotated by an internal annotator as well as by the linguist experts. The preliminary Inter-Annotator Agreement (pairwise agreement) is 0.919.<\/p>\n\n\n\n<p>The annotation process was performed using the web-based tool <a href=\"https:\/\/brat.nlplab.org\/\">brat<\/a>. Below is an example of how the annotated tweets look like:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"512\" height=\"102\" src=\"https:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/unnamed.png\" alt=\"\" class=\"wp-image-4348\" srcset=\"https:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/unnamed.png 512w, https:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/unnamed-300x60.png 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><figcaption>Sample annotation of the SMM4H-Spanish corpus.<\/figcaption><\/figure><\/div>\n\n\n\n<p>All in all, 10,000 tweets were annotated. They were split into 60% training (6,000), 20% development (2,000) and 20% test (2,000). The different splits can be downloaded <a href=\"https:\/\/temu.bsc.es\/smm4h-spanish\/?p=3999\">here<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">FORMAT<\/h3>\n\n\n\n<p><strong>Track A \u2013 Tweet binary classification<\/strong>. Annotations are stored in a tab-separated file with 2 columns:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tweet_id label<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"221\" height=\"207\" src=\"http:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/classification.png\" alt=\"\" class=\"wp-image-4419\"\/><\/figure><\/div>\n\n\n\n<p><strong>Track B \u2013 Tweet binary classification<\/strong>. Annotations are stored in a tab-separated file with 5 columns:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tweet_id begin end type extraction<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"734\" height=\"86\" src=\"http:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/Screenshot-from-2020-12-15-12-03-09.png\" alt=\"\" class=\"wp-image-4395\" srcset=\"https:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/Screenshot-from-2020-12-15-12-03-09.png 734w, https:\/\/temu.bsc.es\/smm4h-spanish\/wp-content\/uploads\/2020\/12\/Screenshot-from-2020-12-15-12-03-09-300x35.png 300w\" sizes=\"auto, (max-width: 734px) 100vw, 734px\" \/><\/figure><\/div>\n\n\n\n<p>In addition, the corpus of Track B is provided in Brat format and in the IOB tagging scheme. See it on <a rel=\"noreferrer noopener\" href=\"https:\/\/doi.org\/10.5281\/zenodo.4309356\" target=\"_blank\">Zenodo<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Training and validation (annotated), test and background (unannotated) datsets Guidelines The SMM4H-Spanish corpus is a collection of 10,000 health-related tweets in Spanish annotated with professions, employment statuses and other work-related activities. The aim of the corpus is to extract professions from social media to enable characterizing health-related issues, in particular in the context of COVID-19 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-4003","post","type-post","status-publish","format-standard","hentry","category-data"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/posts\/4003","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4003"}],"version-history":[{"count":16,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/posts\/4003\/revisions"}],"predecessor-version":[{"id":4541,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=\/wp\/v2\/posts\/4003\/revisions\/4541"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4003"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4003"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/temu.bsc.es\/smm4h-spanish\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4003"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}