{"id":212,"date":"2026-02-26T16:13:10","date_gmt":"2026-02-26T15:13:10","guid":{"rendered":"https:\/\/temu.bsc.es\/MultiClinAI\/?page_id=212"},"modified":"2026-03-24T10:26:56","modified_gmt":"2026-03-24T09:26:56","slug":"multiclinner-data","status":"publish","type":"page","link":"https:\/\/temu.bsc.es\/MultiClinAI\/data\/multiclinner-data\/","title":{"rendered":"MultiClinNER Data"},"content":{"rendered":"\n<p>The corpora for all target languages is available online at zenodo: <a href=\"https:\/\/doi.org\/10.5281\/zenodo.18508037\">MutiClinNER<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>MultiClinNER Data<\/strong><\/h2>\n\n\n\n<p>The <strong>MultiClinNER<\/strong> subtask focuses on multilingual comparable clinical entity recognition. The training data consists of manually validated annotations of three different clinical entities across seven languages: <strong>Spanish <\/strong>(es)<strong>, English <\/strong>(en)<strong>, Dutch <\/strong>(nl)<strong>, Italian <\/strong>(it)<strong>, Romanian <\/strong>(ro)<strong>, Swedish <\/strong>(sv)<strong>, and Czech <\/strong>(cz).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Sources<\/h2>\n\n\n\n<p>The dataset is built from several well-established clinical corpora used in previous clinical shared tasks:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SpaCCC (Spanish Clinical Case Corpus)<\/strong> consists of 1,000 clinical case reports covering multiple medical specialties.<\/li>\n\n\n\n<li><strong>CardioCCC (Cardiology Clinical Case Corpus)<\/strong> comprises 508 clinical case reports from the field of cardiology.<\/li>\n\n\n\n<li><strong>OnaCCC (Original Native Clinical Case Corpus)<\/strong> is a collection of smaller sub-corpora of clinical case reports from open-access journals in each of the target languages (Czech, English, Dutch, Italian, Romanian and Swedish). The number of documents varies by language due to differences in the availability of open-access clinical case reports. Overall, the size of the language-specific sub-corpora ranges from 100 to 1,132 clinical case reports, with 100 documents in Czech, 1,131 in Dutch, 1,132 in English, 317 in Italian, 113 in Romanian, and 100 in Swedish.<\/li>\n<\/ul>\n\n\n\n<p>These corpora were originally annotated following the same annotation guidelines, which cover four clinical entity types:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DISEASE<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li><strong>SYMPTOM<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li><strong>PROCEDURE<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li>MEDICATION<\/li>\n<\/ul>\n\n\n\n<p>For the purposes of the MultiClinAI shared task, only the first three entity types are evaluated.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Multilingual Data Generation<\/h2>\n\n\n\n<p>The comparable multilingual versions of the corpora were created through a combination of <strong>machine translation<\/strong>, <strong>annotation projection<\/strong>, and <strong>expert validation<\/strong>.<\/p>\n\n\n\n<p>First, the original manually annotated Spanish texts were translated into the six target languages using machine translation. Then, the annotated entity mentions from the Spanish Gold Standard were translated independently. A lexical lookup system was then built and used to locate the translated entity mentions within the translated texts.<\/p>\n\n\n\n<p>Finally, bilingual clinical experts validated the projected annotations. Using the side-by-side comparison view in the Brat annotation tool, the projected annotations were corrected so that they matched the original Gold Standard annotations as closely as possible.&nbsp;<\/p>\n\n\n\n<p>To complement the translated data, the OnaCCC corpus provides additional clinical case reports that were originally written in each target language. These texts were translated to Spanish and annotated following the same methodology to ensure consistency across the dataset.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dataset Characteristics<\/h2>\n\n\n\n<p>The MultiClinNER training data therefore includes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Clinical case reports from multiple medical domains<br><\/li>\n\n\n\n<li>Texts available in <strong>seven languages<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li>Expert-validated annotations<br><\/li>\n\n\n\n<li>Three clinical entity types: <strong>DISEASE<\/strong>,<strong> SYMPTOM<\/strong>, and<strong> PROCEDURE<\/strong><\/li>\n<\/ul>\n\n\n\n<p>The dataset contains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>original Spanish clinical case reports<\/strong> used as source texts<\/li>\n\n\n\n<li><strong>Translated texts<\/strong> derived from Spanish source documents<\/li>\n\n\n\n<li><strong>Native texts<\/strong> written directly in the target language<\/li>\n<\/ul>\n\n\n\n<p>This design allows the evaluation of multilingual, cross-lingual, and language-specific NER systems.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Format<\/h2>\n\n\n\n<p>The dataset follows the <strong>BRAT standoff annotation format<\/strong>, where:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>.txt<\/code> files contain the clinical case reports<br><\/li>\n\n\n\n<li><code>.ann<\/code> files contain the entity annotations<\/li>\n<\/ul>\n\n\n\n<p>Each annotation includes the <strong>entity type<\/strong> and the exact <strong>character offsets<\/strong> of the mention within the text.<\/p>\n\n\n\n<p>Example annotation format:<\/p>\n\n\n\n<p><code>T1 DISEASE 125 141 myocardial infarction<\/code><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Folder Structure<\/h2>\n\n\n\n<p>The training data is organized by <strong>language<\/strong> and <strong>entity type<\/strong> to facilitate system development.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>MultiClinNER\/\n \u251c\u2500\u2500 MultiClinNER-es\/\n \u2502    \u251c\u2500\u2500 MultiClinNER-es-train\/\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-disease\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ann\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-disease-0001.ann\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-disease-0002.ann\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 txt\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-disease-0001.txt\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-disease-0002.txt\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-symptom\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-es-train-procedure\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u251c\u2500\u2500 MultiClinNER-cz\/\n \u2502    \u251c\u2500\u2500 MultiClinNER-cz-train\/\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-cz-train-disease\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ann\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 txt\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-cz-train-symptom\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinNER-cz-train-procedure\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u251c\u2500\u2500 MultiClinNER-{nl,en,it,ro,sv}\/ (same as es and cz)<\/code><\/pre>\n\n\n\n<p>Each entity type is separated to simplify experiments focusing on a single concept category.<\/p>\n\n\n\n<p>Participants may use the data to train monolingual or multilingual systems and may submit results for any subset of languages.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The corpora for all target languages is available online at zenodo: MutiClinNER MultiClinNER Data The MultiClinNER subtask focuses on multilingual comparable clinical entity recognition. The training data consists of manually validated annotations of three different clinical entities across seven languages: Spanish (es), English (en), Dutch (nl), Italian (it), Romanian (ro), Swedish (sv), and Czech (cz). [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":16,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-212","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/212","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/comments?post=212"}],"version-history":[{"count":12,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/212\/revisions"}],"predecessor-version":[{"id":264,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/212\/revisions\/264"}],"up":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/16"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/media?parent=212"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}