Dataset | Link |
---|---|
Train dataset for Subtask-1 and Subtask-2 | Train dataset |
Test dataset for Subtask-1 and Subtask-2 | To be published |
ToxHabits Dataset
The ToxHabits dataset is a collection of 1,500 clinical case reports in Spanish from different medical specialties extracted from open-access journals. The corpus includes annotations for substance use and abuse instances in text, along with different pieces of information that characterize each instance. The annotation was done using an event-based structure, heavily based upon the SHAC corpus‘ annotation scheme, which was used for the 2022 n2c2/UW shared task on extracting social determinants of health shared task.
The ToxHabits Gold Standard and its annotation guidelines will be made available soon.
In ToxHabits, each annotated event is initiated by a Trigger (the white label in the examples), which is then characterized by one or more arguments, i.e. pieces of information that surround the trigger and provide context about the substance use. There are four different Trigger types: Tobacco, Cannabis, Alcohol and Drug (used for all other substances).
As for arguments, the are six different types:
- Type: The specific substance used (e.g. cocaine, heroine, poppers). Roughly, it answers to the question “What kind of substance was used?”.
- Method: The way in which substance is used (e.g. intravenously, inhaled). Roughly, it answers to the question “How was the substance used?”.
- Amount: The quantity of substance used (e.g. 2 drinks, 10 cigarettes). Roughly, it answers to the question “How much of the substances was used?”.
- Frequency: The periodicity in which the substance is used (e.g. every day, a year). Roughly, it answers to the question “How often was the substance used?”.
- Duration: The amount of time during which the substance is used (e.g. for two years, since the beginning of the year). Roughly, it answers to the question “For how long was the substance used?”.
- History: The moment in which the substance use stopped (e.g. in 2007). Roughly, it answers to the question “Until when was the substance used?”.
Notably, (a) one sentence may have more than one separate event; (b) the only compulsory argument type is StatusTime, the others depend on their appearance in the text; (c) some events may have more than one argument of the same type, and (d) some annotations may share part of the same text span (i.e. there are nested annotations in the corpus).
The dataset’s annotation will be provided as .ann files with the associated .plain-text file, generated by brat . Additionally to that, JSON files will be provided to the participants, showing the nested dependencies the arguments have in relation to the triggers. Both subtasks can be performed using exclusively the .ann files, but participants are encouraged to use the JSON files to explore more complex approaches. Below is an example of an event annotated in brat along with its JSON equivalent.

