Dataset

Resource Type	Link
📄 Documentation	Starting environment for human annotation of software mentions
💾 Dataset	https://github.com/SoFairOA/Dataset
📜 Paper	Paper

This multidisciplinary dataset, developed by the SoFAIR project with CLARIN-PL, contains over 9,000 manually annotated software mentions and 2,000 relationships across nearly 500 research papers from 18 scientific disciplines. It is available in TEI XML format under a CC-BY license and is useful for evaluating language models and developing software extraction tools.

Statistics

SoFAIR software mentions per discipline

Compatibility with SoftCite dataset

Our goal was to create a gold standard manually annotated dataset with a wider coverage of domains than the existing SoftCite dataset (limited to Economy and Biochemistry). We follow the same annotation guidelines as the SoftCite dataset, and thus it can complement the original data.