Datasets

Train set

The train set contains 5,000 annotated tweets. Will be published on zenodo.

Validation set

The validation set contains 2500 annotated tweets. Will be published on zenodo.

Test and background sets

The test set contains 2500 tweets. The background set contains 50K tweets. Will be published on zenodo.

The test and background set will be published together. You will have to submit predictions for the whole set, but you will only be evaluated with the test set `predictions.

Test set with Gold Standard annotations

The Gold Standard annotations of the test set will be released after the submission deadline

Corpora Stats.

	Training	Development
# Tweets	5000	2500
# characters	1253431	516768
# tokens	211555	84478
Avg. char. /tweet	250.69	206.71
Avg. Tok. /tweet	42.31	33.79
# disease mentions	15173	4252
# unique disease mentions	4407	1413