Dataset
This multidisciplinary dataset, developed by the SoFAIR project with CLARIN-PL, contains over 9,000 manually annotated software mentions and 2,000 relationships across nearly 500 research papers from 18 scientific disciplines. It is available in TEI XML format under a CC-BY license and is useful for evaluating language models and developing software extraction tools.
The dataset is available at https://github.com/SoFairOA/Dataset.
Statistics
Compatibility with SoftCite dataset
Our goal was to create a gold standard manually annotated dataset with a wider coverage of domains than the existing SoftCite dataset (limited to Economy and Biochemistry). We follow the same annotation guidelines as the SoftCite dataset, and thus it can complement the original data.