Clinical trials – MESINESP2: MEDICAL SEMANTIC INDEXING IN SPANISH

You can download the latest version of the dataset from this Zenodo page. Please check that you have downloaded the latest version of the datasets, since we will be including new resources during the shared task duration.

Training

The training dataset contains records from Registro Español de Estudios Clínicos (REEC). This organization does not give documents with the structure title/abstract, for that reason we have built artificial abstracts based on the content available given in the data crawled using their API. For each clinical trial, the API returns the set of fields shown in the image below.

These fields are joined in a string called “abstractText”, in this string we have included the content of each section preceded by its title in capital letters with two # symbols to easily identify it in case the participant wants to split the text. We have excluded those clinical trials that have some of their fields empty or are written in a language other than Spanish.

Since clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3560 clinical trials that were automatically annotated in the latest edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.41, which corresponds with the submission of the best team.

After these clarifications, the following table shows some statistics on the 3560 clinical trial documents provided as a training set:

	REEC training set Abstracts’ length	DeCS Codes
Min.	307	11
Mean (Std)	7700.41 (3572.79)	14.68 (1.19)
Median	7283	15
Max.	21486	22

We provide the training corpus in a JSON string, which follows the same structure as that of subtrack 1:

{
  "articles": [
    {
      "id": "Id of the clinical trial",
      "title": "Title of the article",
      "abstractText": "Compose string of the sections of the clinical trial",
      "journal": "REEC",
      "year": "year of the clinical trial",
      "db": "REEC",
      "decsCodes": [
        "code1",
        "code2",
        "code3"
      ]
    }
  ]
}

Development

For the clinical trials subtrack we provide a Development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.

Some statistics about the test set are shown in the following table:

	REEC training set Abstracts’ length	DeCS Codes
Min.	147	2
Mean (Std)	6637.04 (4055.83)	13.86 (5.53)
Median	6314	14
Max.	17135	29

The Development set is served as JSON strings with the same format as the training set.

Test

The test dataset contains a collection of 8919 items. Out of this subset there are 461 clinical trials coming from REEC and 8458 clinical trials artificially constructed from drug datasheets that have a similar structure to REEC documents.

The evaluation of the systems will be performed on a set of 250 items annotated by DeCS experts following the same protocol as in subtrack 1. Similarly, these items will be published as Gold Standard after completion of the task.

The data of the test set will be served as JSON strings.

The format of the test set data in the JSON string will be the following:

{
  "articles": [
    {
      "id": "Id of the clinical trial",
      "title": "Title of the clinical trial",
      "abstractText": "Content of the clinical trial",
      "journal": "AEM" or "REEC",
      "year": year,
      "db": "AEM" or "REEC"
    }
  ]
}