SmolVLM: Redefining small and efficient multimodal models Paper β’ 2504.05299 β’ Published Apr 7 β’ 202
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper β’ 2411.17465 β’ Published Nov 26, 2024 β’ 89
Improving Vision-Language-Action Model with Online Reinforcement Learning Paper β’ 2501.16664 β’ Published Jan 28 β’ 1
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Paper β’ 2503.16365 β’ Published Mar 20 β’ 40
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper β’ 2503.12605 β’ Published Mar 16 β’ 35
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation Paper β’ 2501.16764 β’ Published Jan 28 β’ 22
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper β’ 2501.05874 β’ Published Jan 10 β’ 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper β’ 2501.06186 β’ Published Jan 10 β’ 65
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper β’ 2501.04519 β’ Published Jan 8 β’ 286
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Paper β’ 2501.04689 β’ Published Jan 8 β’ 17
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization Paper β’ 2501.03271 β’ Published Jan 5 β’ 10
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 46
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Paper β’ 2412.18176 β’ Published Dec 24, 2024 β’ 16
Health AI Developer Foundations (HAI-DEF) Collection Groups models released for use in health AI by Google. Read more about HAI-DEF at http://goo.gle/hai-def β’ 16 items β’ Updated 9 days ago β’ 138
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Paper β’ 2412.04440 β’ Published Dec 5, 2024 β’ 22
ARCLE: The Abstraction and Reasoning Corpus Learning Environment for Reinforcement Learning Paper β’ 2407.20806 β’ Published Jul 30, 2024 β’ 1
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Paper β’ 2410.19168 β’ Published Oct 24, 2024 β’ 23
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper β’ 2401.12168 β’ Published Jan 22, 2024 β’ 29