Qwen2.5-Omni Collection End-to-End Omni (text, audio, image, video, and natural speech interaction) model based Qwen2.5 • 3 items • Updated 3 days ago • 68
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Paper • 2503.16365 • Published 10 days ago • 35
JARVIS-VLA-v1 Collection Vision-Language-Action Models in Minecraft. • 4 items • Updated 8 days ago • 9