DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Abstract
A Diffusion Transformer-based framework generates high-fidelity human-product demonstration videos by preserving identities and spatial relationships, using masked cross-attention and structured text encoding.
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text encoding is used to incorporate category-level semantics, enhancing 3D consistency during small rotational changes across frames. Trained on a hybrid dataset with extensive data augmentation strategies, our approach outperforms state-of-the-art techniques in maintaining the identity integrity of both humans and products and generating realistic demonstration motions. Project page: https://submit2025-dream.github.io/DreamActor-H1/.
Community
We present DreamActor-H1, a novel Diffusion Transformer (DiT)-based framework that generates high-quality human-product demonstration videos from paired human and product images. Trained on a large-scale hybrid dataset with multi-class augmentation, DreamActor-H1 outperforms state-of-the-art methods in preserving human-product identity integrity and generating physically plausible demonstration motions, making it suitable for personalized e-commerce advertising and interactive media. Project page: https://submit2025-dream.github.io/DreamActor-H1/.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation (2025)
- PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement (2025)
- Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation (2025)
- LatentMove: Towards Complex Human Movement Video Generation (2025)
- Subject-driven Video Generation via Disentangled Identity and Motion (2025)
- MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on (2025)
- A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper