You can download the latest version of the dataset from this Zenodo page. Please check that you have downloaded the latest version of the datasets, since we will be including new resources during the shared task duration.
Training
Given the novel nature of this task, in which we are requesting the indexing of patents with DeCS, a controlled vocabulary different from the International Patent Classification taxonomy, the adequate volume of content to train an automatic indexing system starting from scratch is not feasible.
For this reason, we propose the use of transfer learning technologies or model adaptation of the models generated in the previous subtracks. For this purpose, we provide a dataset of patents labeled by experts.
Developmet
We provide a Development set manually indexed by expert annotators. This dataset includes 115 patents in Spanish extracted from Google Patents which have the IPC code “A61P” and “A61K31”. We have selected these patents based on semantic similarity to the MESINESP-L training set to facilitate model generation and to try to improve model performance.
This data is served as JSON string and follow the following format:
{
"articles": [
{
"id": "Id of the patent",
"title": "Title of the patent",
"abstractText": "Content of the patent",
"journal": "patents",
"year": 2018,
"db": "Google Patents",
"decsCodes": [
"code1",
"code2",
"code3"
]
}
]
}
The table below describes some descriptive statistics in the development dataset:
Patent development set Abstracts’ length | DeCS Codes | |
---|---|---|
Min. | 338 | 5 |
Mean (Std) | 2351.72 (1972.69) | 10.02 (3.11) |
Median | 1640.0 | 10 |
Max. | 12876 | 22 |
Test
We provide a test set containing 68404 records that correspond to the total number of patents published in Spanish with the IPC codes specified above. From this set, 150 will be selected and labeled by DeCS experts under the protocol defined in subtask 1, which will be used to evaluate the quality of the developed systems. Similarly to the development set, we selected these 150 records based on semantic similarity to the MESINESP-L training set.
The data of the test set will be served as JSON strings and will follow the following structure:
{
"articles": [
{
"id": "Id of the patent",
"title": "Title of the patent",
"abstractText": "Content of the patent",
"journal": "patents",
"year": year,
"db": "Google Patents"
}
]
}
The annotated dataset will be published as Gold Standard upon completion of the task.