JARVIS-VLA-v1 Collection Vision-Language-Action Models in Minecraft. • 4 items • Updated 4 days ago • 9
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published 10 days ago • 29
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization Paper • 2503.12937 • Published 9 days ago • 26
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Paper • 2503.13399 • Published 9 days ago • 20
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published 12 days ago • 76
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Paper • 2503.10582 • Published 13 days ago • 20
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models Paper • 2503.09669 • Published 14 days ago • 34
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing Paper • 2503.10613 • Published 13 days ago • 73
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary Paper • 2503.09402 • Published 14 days ago • 6
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Paper • 2503.04724 • Published 20 days ago • 66
view article Article A Deepdive into Aya Vision: Advancing the Frontier of Multilingual Multimodality 22 days ago • 69
C4AI Aya Vision Collection Aya Vision is a state-of-the-art family of vision models that brings multimodal capabilities to 23 languages. • 5 items • Updated 22 days ago • 68
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models Paper • 2502.16033 • Published Feb 22 • 17