Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 66
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 27
view article Article Releasing the largest multilingual open pretraining dataset By Pclanglais • 4 days ago • 88
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists Paper • 2410.23331 • Published 18 days ago • 7
SmolLM2 Collection State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 8 items • Updated 13 days ago • 167
Granite 3.0 Language Models Collection A series of language models trained by IBM licensed under Apache 2.0 license. We release both the base pretrained and instruct models. • 8 items • Updated 13 days ago • 87
view article Article Releasing Outlines-core 0.1.0: structured generation in Rust and Python 27 days ago • 41
view article Article ColFlor: Towards BERT-Size Vision-Language Document Retrieval Models By ahmed-masry • about 1 month ago • 15
view article Article OCR Processing and Text in Image Analysis with DeepSeek Janus-1.3B By PandorAI1995 • 26 days ago • 2
view article Article OCR Processing and Text in Image Analysis with Florence-2-base and Qwen2-VL-2B By PandorAI1995 • 30 days ago • 13
view article Article 🇮🇹🇯🇵🇧🇷 Generating multilingual instruction datasets with Magpie 🐦⬛ By anakin87 • 27 days ago • 18
view article Article How to build a custom text classifier without days of human labeling By sdiazlor • Oct 17 • 55