The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models Paper • 2310.20440 • Published Oct 31, 2023