Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs Paper • 2506.19290 • Published Jun 24 • 51
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5 • 41
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs Paper • 2507.03253 • Published Jul 4 • 18
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper • 2507.16746 • Published Jul 22 • 34
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published 25 days ago • 57
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training Paper • 2508.17677 • Published 15 days ago • 14