- HITSZ-ICRC: https://github.com/xy-always/2020Iberlef
- lasigeBio: https://github.com/lasigeBioTM/CANTEMIST-Participation
- Hulat-UC3M: https://github.com/ssantamaria94/CANTEMIST-Participation
- ICB-UMA: https://github.com/guilopgar/CANTEMIST-2020
- Tong Wang: https://github.com/18720936539/CANTEMIST/
- Kathrync: https://github.com/kathrynchapman/CANTEMIST2020
- Recognai: https://github.com/recognai/cantemist-ner
Annotation Guidelines
Annotation guidelines can be downloaded from Zenodo.
Cantemist train development, test and background sets are already available at Zenodo
The Cantemist corpus was manually annotated by clinical experts following the Cantemist guidelines. These guidelines contain rules for annotating morphology neoplasms in Spanish oncology clinical cases; as well as for mapping these annotations to CIEO-3 (Spanish version of ICD-O-3).
Guidelines were created de novo by clinical experts in three phases:
- First, a zero version of guidelines after the clinical experts reviewed neoplasm morphology annotations in SPACCC corpus (Codiesp guidelines, for tumor morphology).
- Second, a stable version of guidelines was reached while annotating sample sets of Cantemist corpus iteratively until quality control was satisfactory.
- Third, guidelines are iteratively refined as manual annotation continues.
Post-annotation review steps:
- Consistency review: occurrences of all annotations were looked up in all documents and a clinical expert reviewed whether they should be added to the annotations.
- CIEO-3 Code length review: all codes were checked to have 4 or more characters and 7 or fewer characters (8140/32 CIEO-3 code has 7 characters).
- Trailing newline: newline characters (\n) are removed from annotations.
- Internal newline check: annotations with newline characters within them are removed since they span more than one line.
- Starting and ending annotation characters: check that all annotations start and end with an alphanumeric character or a parenthesis. For example (,adenocarcinoma, would be a wrong annotation since it is surrounded by commas).
Datasets
Annotation guidelines can be downloaded from Zenodo.
Cantemist train, development test and background sets are already available at Zenodo
Train Set
The annotated train set is composed of 501 clinical cases. Download it from Zenodo.
Development Set
There are two development sets, of 250 documents each. Download it from Zenodo.
Test and Background Set
The unannotated test and background sets are released on July, 3. They are released together. The goal of the shared task is to predict the annotations for these documents. Participants will submit predictions for both of them. However, they will only be evaluated on the predictions for the test set. Download them from Zenodo.
Test Set with Gold Standard annotations
The Gold Standard annotations of the test set (300 documents) are already released. Download them from Zenodo.
Description of the Corpus
Annotation guidelines can be downloaded from Zenodo.
Cantemist train, development, test and background sets are already available at Zenodo
Corpus format
- Subtasks CANTEMIST-NER: Brat annotation format.
Figure 1. Example Brat annotation for Cantemist-Ner.
- CANTEMIST-NORM: Brat annotation format.
Figure 2. Example Brat annotation for Cantemist-Norm.
- Subtask CANTEMIST-CODING: CodiEsp format. We provide a single plain text file per clinical case and a tab-separated file with all the unique codes per clinical case (see Figure 3).
Figure 3. Example tab-separated file for Cantemist-Coding.
General information
For this task professional clinical coding experts have annotated a corpus of clinical cases in Spanish with eCIE-O-3.1 codes using the BRAT annotation tool following well-defined annotation guidelines adapted form the clinical coding recommendations published by the Spanish Ministry of Health, after several cycles of quality control and annotation consistency analysis before annotating the entire dataset. Figure 2 shows a screenshot of a sample manual annotation generated using the BRAT annotation tool.
Figure 2. Example BRAT annotation with labeled tumor morphology entity mention.
The CANTEMIST corpus consists of a collection of 3000 clinical cases that will be distributed in plain text in UTF8 encoding, where each clinical case would be stored as a single file. These clinical case reports were carefully selected to represent records reflecting as much as possible clinical narrative related to electronic clinical reports. Figure 3 illustrates an example text snippet corresponding to a short sample record.
Figure 3. Example plain text CANTEMIST corpus document
Additionally, we will also provide the annotation files comprising the character offsets of the tumor morphology entity mentions in TSV (tab-separated values) BRAT format together with their corresponding eCIE-O-3.1 code annotations.
The final corpus will be randomly split into three subsets: training, development and test. In the case of training and development sets, additionally, to the clinical cases, a TSV file will be released. It will contain one row per annotation. Each row will consist of the eCIE-O-3.1 code of the clinical case, a label indicating the category of the annotation, the annotation code and a reference to the text span that stimulated the annotation (the evidence).
In addition to the test set, a larger background set of clinical case documents will be released to make sure that participating teams will not be able to do manual corrections. In addition, the background set will become a silver standard of texts coded through automatic eCIE-O-3.1 code predictions returned by participating teams.
The goal of the CANTEMIST task is to develop automatic eCIE-O-3.1 clinical coding systems for Spanish medical texts. These systems should rely on the use of the CANTEMIST corpus, a high-quality Gold Standard synthetic clinical corpus of 3000 records based on a manual annotation process done by human clinical coding experts together with an inter-annotator agreement consistency analysis.
The CANTEMIST task can be approached as a named entity recognition and normalization task, but also as a multi-class text classification task. Participants are encouraged to either propose solutions in one of these directions or to combine both approaches. As well, novel approaches are welcomed.
Resources
Evaluation Script
- Official evaluation script: Available on GitHub (beta version).
This is the official evaluation script of the task.
Word embeddings
- Spanish Medical Word Embeddings. Word embeddings generated from Spanish medical corpora. Download them from Zenodo.
It can be used as a building block for clinical NLP systems used in Spanish texts.
Baseline
Dictionary lookup based on Levenshtein distance. It looks for train and development annotations in the test set.
Results (precision, recall, f1):
- NER: 0.181, 0.737, 0.291
- NORM: 0.18, 0.73, 0.288
- CODING (MAP): 0.584
Linguistic Resources
- CUTEXT. See it on GitHub.
Medical term extraction tool.
It can be used to extract relevant medical terms from clinical cases. - SPACCC POS Tagger. See it on Zenodo.
Part Of Speech Tagger for Spanish medical domain corpus.
It can be used as a component of your system. - NegEx-MES. See it on Zenodo.
A system for negation detection in Spanish clinical texts based on NegEx algorithm.
It can be used as a component of your system. - Negation corpus. See it on GitHub
A Corpus of Negation and Uncertainty in Spanish Clinical Texts (and instructions to train the system). - AbreMES-X. See it on Zenodo.
Software used to generate the Spanish Medical Abbreviation DataBase. - AbreMES-DB. See it on Zenodo.
Spanish Medical Abbreviation DataBase.
It can be used to fine-tune your system. - MeSpEn Glossaries. See it on Zenodo.
Repository of bilingual medical glossaries made by professional translators.
It can be used to fine-tune your system.
Terminological Resources
- List of valid codes. Download it from here.
List of valid ICD-O-3 codes used in the task evaluation.