Subject-driven Video Generation via Disentangled Identity and Motion
Abstract
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
Community
We present Subject-to-Video, a tuning-free framework that turns just a single reference images into identity-faithful, motion-smooth videos—trained without any custom video dataset!
Disentangle identity ✕ motion and beat prior personalized T2V models in zero-shot scenarios.
Paper : https://arxiv.org/html/2504.17816v1
Code : https://github.com/carpedkm/disentangled-subject-to-vid
Project page : https://carpedkm.github.io/projects/disentangled_sub/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance (2025)
- VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation (2025)
- JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation (2025)
- EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models (2025)
- Personalize Anything for Free with Diffusion Transformer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper