SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models Paper • 2506.04180 • Published Jun 4 • 32
MangaliCa Train / Eval Dataset 🐗 Collection Collection of MangaliCa's pre-training datasets • 8 items • Updated May 22 • 2
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch By ariG23498 and 6 others • May 21 • 196
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models Paper • 2412.02980 • Published Dec 4, 2024 • 15
Best Practices and Lessons Learned on Synthetic Data for Language Models Paper • 2404.07503 • Published Apr 11, 2024 • 32