{"id":4003,"date":"2019-09-19T12:55:02","date_gmt":"2019-09-19T12:55:02","guid":{"rendered":"http:\/\/temu.bsc.es\/meddocan\/?p=4003"},"modified":"2019-10-03T16:34:35","modified_gmt":"2019-10-03T14:34:35","slug":"description-of-the-corpus","status":"publish","type":"post","link":"https:\/\/temu.bsc.es\/meddocan\/index.php\/description-of-the-corpus\/","title":{"rendered":"Description of the Corpus"},"content":{"rendered":"\n<p>For this task, we have prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. This MEDDOCAN corpus of 1,000 clinical case studies was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records.<\/p>\n\n\n\n<p>The final collection of 1,000 clinical cases that make up the corpus had around 33 thousand sentences, with an average of around 33 sentences per clinical case. The MEDDOCAN corpus contains around 495 thousand words, with an average of 494 words per clinical case, slightly less than for the records of the i2b2 de-identification longitudinal corpus (617 tokens per record). The MEDDOCAN corpus will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file, while PHI annotations will be released in the popular BRAT format, which makes visualization of results straightforward, as you can see in Figure 1.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" width=\"1024\" height=\"922\" src=\"http:\/\/temu.bsc.es\/meddocan\/wp-content\/uploads\/2019\/03\/image-1-1024x922.png\" alt=\"\" class=\"wp-image-3540\" srcset=\"https:\/\/temu.bsc.es\/meddocan\/wp-content\/uploads\/2019\/03\/image-1-1024x922.png 1024w, https:\/\/temu.bsc.es\/meddocan\/wp-content\/uploads\/2019\/03\/image-1-300x270.png 300w, https:\/\/temu.bsc.es\/meddocan\/wp-content\/uploads\/2019\/03\/image-1-768x692.png 768w, https:\/\/temu.bsc.es\/meddocan\/wp-content\/uploads\/2019\/03\/image-1.png 1045w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1: An example of MEDDOCAN annotation visualized using the BRAT annotation interface.<\/figcaption><\/figure>\n\n\n\n<p>The entire MEDDOCAN corpus has been randomly sampled into three subsets, the training, development and test set. &nbsp;The&nbsp;<strong>training set<\/strong>&nbsp;comprises&nbsp;<strong>500 clinical cases<\/strong>, and the&nbsp;<strong>development and test set 250 clinical cases each<\/strong>. Together with the test set release, we will release an additional collection of 2,000 documents (background set) to make sure that participating teams will not be able to do manual corrections and also promote that these systems would potentially able to scale to larger data collections.<br><\/p>\n\n\n\n<p>For this task we also prepared a conversion script (see <a href=\"http:\/\/temu.bsc.es\/meddocan\/index.php\/resources\/\">Resources<\/a>) between the BRAT annotation format and the annotation format used by the previous <em>i2b2<\/em> effort, to make comparison and adaptation of previous systems used for English texts easier. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>For this task, we have prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. This MEDDOCAN corpus of 1,000 clinical case studies was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records. The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/posts\/4003"}],"collection":[{"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/comments?post=4003"}],"version-history":[{"count":2,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/posts\/4003\/revisions"}],"predecessor-version":[{"id":4100,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/posts\/4003\/revisions\/4100"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/media?parent=4003"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/categories?post=4003"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/temu.bsc.es\/meddocan\/index.php\/wp-json\/wp\/v2\/tags?post=4003"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}