Annotation guidelines can be downloaded from Zenodo.
Cantemist train development, test and background sets are already available at Zenodo
The Cantemist corpus was manually annotated by clinical experts following the Cantemist guidelines. These guidelines contain rules for annotating morphology neoplasms in Spanish oncology clinical cases; as well as for mapping these annotations to CIEO-3 (Spanish version of ICD-O-3).
Guidelines were created de novo by clinical experts in three phases:
- First, a zero version of guidelines after the clinical experts reviewed neoplasm morphology annotations in SPACCC corpus (Codiesp guidelines, for tumor morphology).
- Second, a stable version of guidelines was reached while annotating sample sets of Cantemist corpus iteratively until quality control was satisfactory.
- Third, guidelines are iteratively refined as manual annotation continues.
Post-annotation review steps:
- Consistency review: occurrences of all annotations were looked up in all documents and a clinical expert reviewed whether they should be added to the annotations.
- CIEO-3 Code length review: all codes were checked to have 4 or more characters and 7 or fewer characters (8140/32 CIEO-3 code has 7 characters).
- Trailing newline: newline characters (\n) are removed from annotations.
- Internal newline check: annotations with newline characters within them are removed since they span more than one line.
- Starting and ending annotation characters: check that all annotations start and end with an alphanumeric character or a parenthesis. For example (,adenocarcinoma, would be a wrong annotation since it is surrounded by commas).