rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper • 2501.04519 • Published 10 days ago • 230
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Paper • 2501.03895 • Published 11 days ago • 48
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 15 days ago • 40
ProgCo: Program Helps Self-Correction of Large Language Models Paper • 2501.01264 • Published 16 days ago • 25
Cosmos World Foundation Model Platform for Physical AI Paper • 2501.03575 • Published 11 days ago • 63
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper • 2501.05510 • Published 9 days ago • 35
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper • 2501.00599 • Published 18 days ago • 41
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints Paper • 2501.03841 • Published 11 days ago • 49
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control Paper • 2501.01427 • Published 16 days ago • 49
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper • 2501.05874 • Published 8 days ago • 61
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation Paper • 2501.01895 • Published 15 days ago • 48
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper • 2501.00958 • Published 17 days ago • 95
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper • 2412.19723 • Published 22 days ago • 79
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog • 9 items • Updated 12 days ago • 53