{"id":16,"date":"2024-03-25T09:36:18","date_gmt":"2024-03-25T09:36:18","guid":{"rendered":"https:\/\/temu.bsc.es\/multicardioner\/?page_id=16"},"modified":"2024-05-02T14:27:38","modified_gmt":"2024-05-02T14:27:38","slug":"corpus-description","status":"publish","type":"page","link":"https:\/\/temu.bsc.es\/multicardioner\/corpus-description\/","title":{"rendered":"Corpus Description"},"content":{"rendered":"\n<h1 class=\"wp-block-heading\">Corpus Description<\/h1>\n\n\n\n<h2 class=\"wp-block-heading\">About the texts<\/h2>\n\n\n\n<p>MultiCardioNER, being focused on the adaptation of general clinical models to specific specialties and the creation of multilingual models, uses multiple datasets:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <a href=\"https:\/\/temu.bsc.es\/distemist\"><strong>DisTEMIST<\/strong><\/a> and <strong>DrugTEMIST<\/strong> (newly-released for this task) corpora are a collection of <strong>1,000 clinical cases in Spanish<\/strong> from different medical specialties (incl. oncology, otorhinolaryngology, dentistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology and more). They are annotated with <strong>disease<\/strong> and <strong>medication<\/strong> mentions, respectively. Both of them use the same text documents, which belong to the <a href=\"https:\/\/zenodo.org\/record\/2560316\" target=\"_blank\" rel=\"noreferrer noopener\">SPACCC<\/a> corpus and are also the same ones used in <a href=\"https:\/\/temu.bsc.es\/medprocner\">MedProcNER\/ProcTEMIST<\/a> and <a href=\"https:\/\/temu.bsc.es\/symptemist\">SympTEMIST<\/a>, making all four datasets complementary for medical entity recognition. The DrugTEMIST corpus is also released in English and Italian.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A collection of cardiology clinical case reports (<strong>CardioCCC<\/strong>) is used for the domain adaptation part of the task. Clinical case reports are a type of textual genre in the field of medicine that describe a patient&#8217;s medical history, symptoms, diagnosis, and treatment in detail. The dataset contains <strong>508 documents<\/strong>, split in 258 for development and 250 for testing. It has been annotated with diseases and medications using the guidelines as the DisTEMIST and DrugTEMIST corpora. The medications part is released in three languages: Spanish, English and Italian. Although the 258 document split was originally divised as a development set, participants are allowed to mix-and-match their data as they see fit to create different experiments.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">About the annotations<\/h2>\n\n\n\n<p>All datasets were manually annotated by clinical experts using the <a href=\"https:\/\/brat.nlplab.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">brat annotation tool<\/a> following well-defined annotation guidelines, defined after several cycles of quality control and annotation consistency analysis before annotating the entire dataset.<\/p>\n\n\n\n<p>The annotations were originally created in Spanish and then transferred into English and Italian via machine translation and lexical annotation projection. The result of this process was revised by clinicians who are also native speakers of each language to validate them. To account for possible mistranslations that affected the clinical entities, these experts provided alternative suggestions for annotated entities in case they didn&#8217;t agree with the automatic translation. These translations were then integrated into the text by replacing the existing annotated span with their proposed text.<\/p>\n\n\n\n<p>More information on the annotation scheme (and their inter-annotator agreement) is available in the <a href=\"https:\/\/temu.bsc.es\/multicardioner\/annotation-guidelines\/\" target=\"_blank\" rel=\"noreferrer noopener\">Annotation Guidelines<\/a> page.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">About the format<\/h2>\n\n\n\n<p>MultiCardioNER is made available on two different formats: .ann (used by brat) and .tsv. For more information on brat\u2019s format please visit: <a href=\"https:\/\/brat.nlplab.org\/standoff.html\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/brat.nlplab.org\/standoff.html<\/a>. The .TSV file columns are explained in the dataset\u2019s attached README file.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Corpus Description About the texts MultiCardioNER, being focused on the adaptation of general clinical models to specific specialties and the creation of multilingual models, uses multiple datasets: About the annotations All datasets were manually annotated by clinical experts using the brat annotation tool following well-defined annotation guidelines, defined after several cycles of quality control and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-16","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/pages\/16","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/comments?post=16"}],"version-history":[{"count":5,"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/pages\/16\/revisions"}],"predecessor-version":[{"id":139,"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/pages\/16\/revisions\/139"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/multicardioner\/wp-json\/wp\/v2\/media?parent=16"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}