Evaluation
The evaluation will be conducted by comparing the automatically generated outputs with the expert-annotated gold standard, ensuring a robust assessment of system performance.
Evaluation metrics
The primary evaluation metric for the ToxNER (Subtask 1) and ToxUse (Subtask 2) will consist of micro-averaged precision, recall and F1-scores:

Evaluation library
The evaluation scripts, along with a README file containing detailed instructions, are available on GitHub to facilitate systematic fine-tuning and performance improvement on the provided training data by participating teams.
You can run the script using:
$> python main.py -g ..\ground_truth\gs.tsv -p ..\dummy_submission\test_submission.tsv -s ner