MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

This repository contains the MotionStreamer model as presented in MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space.

Project Page: https://zju3dv.github.io/MotionStreamer/

This paper addresses the challenge of text-conditioned streaming motion generation, which requires predicting the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle with this; diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation due to discretized non-causal tokenization. MotionStreamer incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. By establishing temporal causal dependencies between current and historical motion latents, the model fully utilizes available information for accurate online motion decoding. Experiments show that this method outperforms existing approaches and offers applications including multi-round generation, long-term generation, and dynamic motion composition.

@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
    }