{"id":218,"date":"2026-02-26T16:31:35","date_gmt":"2026-02-26T15:31:35","guid":{"rendered":"https:\/\/temu.bsc.es\/MultiClinAI\/?page_id=218"},"modified":"2026-03-04T10:57:52","modified_gmt":"2026-03-04T09:57:52","slug":"multiclincorpus-data","status":"publish","type":"page","link":"https:\/\/temu.bsc.es\/MultiClinAI\/data\/multiclincorpus-data\/","title":{"rendered":"MultiClinCorpus Data"},"content":{"rendered":"\n<p>The corpora for all target languages is available online at zenodo: <a href=\"https:\/\/zenodo.org\/records\/18772832\">MutiClinCorpus<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>MultiClinCorpus Data<\/strong><\/h2>\n\n\n\n<p>The <strong>MultiClinCorpus<\/strong> subtask focuses on the automatic creation of comparable multilingual clinical corpora through cross-lingual, weakly supervised methods for training data generation. Unlike MultiClinNER, where systems extract entities directly from text, this task requires systems to generate annotated corpora in multiple target languages starting from a Spanish seed corpus. Participants are encouraged to explore a range of automatic approaches for transferring or inducing named entity labels across languages\u2014particularly methods designed to support low-resource settings and to reduce the high cost and scarcity of manual annotation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Task Setting<\/h2>\n\n\n\n<p>Participants are provided with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>Spanish Gold Standard corpus<\/strong> with manually annotated entities.<br><\/li>\n\n\n\n<li>The <strong>translated versions of the same texts<\/strong> in six target languages.<br><\/li>\n\n\n\n<li>Training examples where entity correspondences across languages have been manually validated.<br><\/li>\n<\/ul>\n\n\n\n<p>The goal is to automatically identify the exact character offsets of the corresponding entity mentions in the translated texts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Sources<\/h2>\n\n\n\n<p>The underlying textual resources are the same as those used in <a href=\"https:\/\/temu.bsc.es\/MultiClinAI\/data\/multiclinner-data\/\">MultiClinNER<\/a>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SpaCCC<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li><strong>CardioCCC<\/strong><strong><br><\/strong><\/li>\n\n\n\n<li><strong>OnaCCC<\/strong><\/li>\n<\/ul>\n\n\n\n<p>These corpora include clinical case reports covering multiple medical specialties and were annotated using the same guidelines to ensure consistency across datasets.<\/p>\n\n\n\n<p>The entity types included in the task are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>DISEASE<\/li>\n\n\n\n<li>SYMPTOM<\/li>\n\n\n\n<li>PROCEDURE<\/li>\n<\/ul>\n\n\n\n<p>The seed language for annotation projection is <strong>Spanish<\/strong>, and projections are required for:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Czech<br><\/li>\n\n\n\n<li>English<br><\/li>\n\n\n\n<li>Dutch<br><\/li>\n\n\n\n<li>Italian<br><\/li>\n\n\n\n<li>Romanian<br><\/li>\n\n\n\n<li>Swedish<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Dataset Characteristics<\/h2>\n\n\n\n<p>The resulting dataset provides:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Parallel clinical case reports across seven languages<br><\/li>\n\n\n\n<li>Expert-validated cross-lingual entity correspondences<br><\/li>\n\n\n\n<li>Three clinically relevant entity categories<br><\/li>\n\n\n\n<li>Texts from both translated and native clinical documents<br><\/li>\n<\/ul>\n\n\n\n<p>This dataset enables the development and evaluation of methods such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Annotation projection<br><\/li>\n\n\n\n<li>Word alignment<br><\/li>\n\n\n\n<li>Multilingual representation learning<br><\/li>\n\n\n\n<li>Generative approaches<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Data Format<\/h2>\n\n\n\n<p>The data is distributed in <strong>BRAT standoff format<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>.txt<\/code> files contain the document text.<br><\/li>\n\n\n\n<li><code>.ann<\/code> files contain the validated entity spans.<\/li>\n<\/ul>\n\n\n\n<p>Each entity annotation corresponds to a projected mention aligned with the Spanish Gold Standard.<\/p>\n\n\n\n<p>Example annotation format:<\/p>\n\n\n\n<p>(es)<\/p>\n\n\n\n<p><code>T1 DISEASE 253 285 insuficiencia respiratoria grave<\/code><\/p>\n\n\n\n<p>(en)<\/p>\n\n\n\n<p><code>T1 DISEASE 238 264 severe respiratory failure<\/code><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Folder Structure<\/h2>\n\n\n\n<p>The training data is organized by <strong>language <\/strong>and<strong> entity type<\/strong> to facilitate system development.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>MultiClinCorpus\/\n \u251c\u2500\u2500 MultiClinCorpus-es\/\n \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train\/\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-disease\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ann\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-disease-0001.ann\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-disease-0002.ann\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 txt\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-disease-0001.txt\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-disease-0002.txt\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-symptom\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-es-train-procedure\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u251c\u2500\u2500 MultiClinCorpus-cz\/\n \u2502    \u251c\u2500\u2500 MultiClinCorpus-cz-train\/\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-cz-train-disease\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ann\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 txt\/\n \u2502    \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-cz-train-symptom\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u2502    \u2502    \u251c\u2500\u2500 MultiClinCorpus-cz-train-procedure\/\n \u2502    \u2502    \u2502    \u251c\u2500\u2500 ...\n \u251c\u2500\u2500 MultiClinCorpus-{nl,en,it,ro,sv}\/ (same as es and cz)<\/code><\/pre>\n\n\n\n<p>This structure enables participants to easily identify the <strong>seed annotations in Spanish <\/strong>and the <strong>corresponding annotations in the target languages<\/strong>.&nbsp;<\/p>\n\n\n\n<p>Participants may submit results for any target language, and systems may use any modeling approach capable of identifying the projected entity spans.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The corpora for all target languages is available online at zenodo: MutiClinCorpus MultiClinCorpus Data The MultiClinCorpus subtask focuses on the automatic creation of comparable multilingual clinical corpora through cross-lingual, weakly supervised methods for training data generation. Unlike MultiClinNER, where systems extract entities directly from text, this task requires systems to generate annotated corpora in multiple [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":16,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-218","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/218","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/comments?post=218"}],"version-history":[{"count":8,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/218\/revisions"}],"predecessor-version":[{"id":252,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/218\/revisions\/252"}],"up":[{"embeddable":true,"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/pages\/16"}],"wp:attachment":[{"href":"https:\/\/temu.bsc.es\/MultiClinAI\/wp-json\/wp\/v2\/media?parent=218"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}