Papers
arxiv:2505.10238

MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

Published on May 15
· Submitted by yanboding on May 20
Authors:
,
,

Abstract

Human image animation has gained increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely largely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information for open-world animation. To tackle this problem, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences (i.e., 4D motion) for human image animation. Specifically, we introduce 4DMoT (4D motion tokenizer) to quantize 3D motion sequences into 4D motion tokens. Compared to 2D-rendered pose images, 4D motion tokens offer more robust spatio-temporal cues and avoid strict pixel-level alignment between pose image and character, enabling more flexible and disentangled control. Then, we introduce MV-DiT (Motion-aware Video DiT). By designing unique motion attention with 4D positional encodings, MV-DiT can effectively leverage motion tokens as 4D compact yet expressive context for human image animation in the complex 3D world. Hence, it marks a significant step forward in this field and opens a new direction for pose-guided human video generation. Experiments show that our MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, surpassing the second-best by 65%. Powered by robust motion tokens, MTVCrafter also generalizes well to diverse open-world characters (single/multiple, full/half-body) across various styles and scenarios. Our video demos and code are on: https://github.com/DINGYANB/MTVCrafter.

Community

Paper author Paper submitter

截屏2025-05-20 15.27.28.png

Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
To tackle these problems, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.

We introduce 4DMoT (4D motion tokenizer) to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
Then, we propose MV-DiT (Motion-aware Video DiT), which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.
MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, outperforming the second-best by approximately 65%. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.

Our codes are:
https://github.com/DINGYANB/MTVCrafter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.10238 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.10238 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.