LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published 7 days ago • 20
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models Paper • 2505.02735 • Published 7 days ago • 27
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning Paper • 2505.02835 • Published 6 days ago • 22
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published 7 days ago • 79
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper • 2505.03739 • Published 5 days ago • 8
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models Paper • 2505.03821 • Published 9 days ago • 22
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant Paper • 2505.05467 • Published 3 days ago • 13
SkillMimic-V2: Learning Robust and Generalizable Interaction Skills from Sparse and Noisy Demonstrations Paper • 2505.02094 • Published 8 days ago • 16
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers Paper • 2503.11579 • Published Mar 14 • 20
Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning Paper • 2503.11646 • Published Mar 14 • 36
MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation Paper • 2503.14428 • Published Mar 18 • 9
MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation Paper • 2503.14428 • Published Mar 18 • 9
LLaVA-o1: Let Vision Language Models Reason Step-by-Step Paper • 2411.10440 • Published Nov 15, 2024 • 125
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Paper • 2410.18558 • Published Oct 24, 2024 • 20