SmolLM3 pretraining datasets Collection datasets used in SmolLM3 pretraining • 15 items • Updated Aug 12 • 35
🇩🇪German SFT and DPO datasets Collection Datasets that can be used for LLM training with axolotl, trl or llama_factory. • 33 items • Updated Jan 23 • 12
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Aug 14 • 21
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 20
Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 33 items • Updated Sep 18 • 94