PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement
Abstract
PolyVivid is a multi-subject video customization framework that uses text-image fusion, 3D-RoPE enhancement, attention-inherited identity injection, and MLLM-based data processing to ensure identity consistency and realistic video generation.
Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation (2025)
- MAGREF: Masked Guidance for Any-Reference Video Generation (2025)
- OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation (2025)
- Subject-driven Video Generation via Disentangled Identity and Motion (2025)
- AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment (2025)
- BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation (2025)
- FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper