MEDDOCAN: Medical Document Anonymization Track
The MEDDOCAN Track is sponsored by Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
IberLEF 2019 Workshop @ SEPLN 2019 (Bilbao), 24th Sep 2019
Generated resources
- Data: Gold Standard, Silver Standard & Annotation Guidelines
- Conference Proceedings
Please, cite Montserrat Marimon et al. “Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results.” In: IberLEF@ SEPLN. 2019, pp. 618–638.
Schedule
Date | Event |
---|---|
March 18, 2019 | Sample set and Evaluation script released. |
March 20, 2019 | Training set released. |
April 4, 2019 | Development set released. |
April 29, 2019 | Test set released (includes background set). |
May 17, 2019 | End of evaluation period (system submissions). |
May 20, 2019 | Resuls posted and Test set with GS annotations released. |
May 31, 2019 | Working notes paper submission. |
June 14, 2019 | Notification of acceptance (peer-reviews). |
June 28, 2019 | Camera ready paper submission. |
September 24, 2019 | IberLEF 2019 Workshop. |
About the task
Clinical records with protected health information (PHI) cannot be directly shared “as is”, due to privacy constraints, making it particularly cumbersome to carry out NLP research in the medical domain. A necessary precondition for accessing clinical records outside of hospitals is their de-identification, i.e., the exhaustive removal, or replacement, of all mentioned PHI phrases.
The practical relevance of anonymization or de-identification of clinical texts motivated the proposal of two shared tasks, the 2006 and 2014 de-identification tracks, organized under the umbrella of the i2b2 (i2b2.org) community evaluation effort. The i2b2 effort has deeply influenced the clinical NLP community worldwide, but was focused on documents in English and covering characteristics of US-healthcare data providers.
As part of the IberLEF 2019 initiative we organize the first community challenge task specifically devoted to the anonymization of medical documents in Spanish, called the MEDDOCAN (Medical Document Anonymization) task.
In order to carry out these tasks we have prepared a synthetic corpus of 1000 clinical case studies. This corpus was selected manually by a practicing physician and augmented with PHI information from
discharge summaries and medical genetics clinical records.
The MEDDOCAN task will be structured into two sub-tasks:
- NER offset and entity type classification.
- Sensitive span detection.
MEDDOCAN Corpus
For this task, we have prepared a synthetic corpus of clinical cases enriched with PHI expressions, named the MEDDOCAN corpus. This MEDDOCAN corpus of 1,000 clinical case studies was selected manually by a practicing physician and augmented with PHI phrases by health documentalists, adding PHI information from discharge summaries and medical genetics clinical records. See an example of MEDDOCAN annotation visualized using the BRAT annotation interface in Figure 1.
For more detailed information see Description of the Corpus.