FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper ⢠2506.20920 ⢠Published 11 days ago ⢠57
SmolVLM: Redefining small and efficient multimodal models Paper ⢠2504.05299 ⢠Published Apr 7 ⢠192
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper ⢠2502.02737 ⢠Published Feb 4 ⢠235
Towards Best Practices for Open Datasets for LLM Training Paper ⢠2501.08365 ⢠Published Jan 14 ⢠64
SelfCodeAlign: Self-Alignment for Code Generation Paper ⢠2410.24198 ⢠Published Oct 31, 2024 ⢠25