Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Paper • 2506.00123 • Published 15 days ago • 33
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization Paper • 2503.19901 • Published Mar 25 • 41
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space Paper • 2503.15451 • Published Mar 19 • 15
Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation Paper • 2503.13424 • Published Mar 17 • 30
view article Article π0 and π0-FAST: Vision-Language-Action Models for General Robot Control By danaaubakirova and 3 others • Feb 4 • 162
view article Article MotionLCM-V2: Improved Compression Rate for Multi-Latent-Token Diffusion By wxDai • Dec 11, 2024 • 17
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Paper • 2412.07589 • Published Dec 10, 2024 • 49