SkyLadder: Better and Faster Pretraining via Context Window Scheduling Paper • 2503.15450 • Published 5 days ago • 11
InsectSet459: an open dataset of insect sounds for bioacoustic machine learning Paper • 2503.15074 • Published 6 days ago • 1
Brazilian legal datasets ⚖️ Collection A collection of data extracted from the courts of Brazil (and others legal websites) • 31 items • Updated 5 days ago • 2
Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru Paper • 2503.07587 • Published 14 days ago • 10
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 14 days ago • 95
JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments Paper • 2503.08379 • Published 13 days ago • 2
EuroBERT: Scaling Multilingual Encoders for European Languages Paper • 2503.05500 • Published 17 days ago • 75
view article Article HuggingFace, IISc partner to supercharge model building on India's diverse languages 26 days ago • 17
rank1 Collection rank1 is the first test-time compute reasoning model in IR • 15 items • Updated 25 days ago • 3
OWLS: Scaling Laws for Speech Recognition and Translation Collection 🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. • 7 items • Updated 15 days ago • 4
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Paper • 2502.15964 • Published Feb 21 • 1
"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts Paper • 2502.16839 • Published 29 days ago • 1
Slam Collection All resources for SpeechLMs from "Slamming: Training a Speech Language Model on One GPU in a Day". We provide tokeniser, lm, and datasets • 6 items • Updated 28 days ago • 13
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models Paper • 2502.17387 • Published 28 days ago • 5