DreamO: A Unified Framework for Image Customization
Abstract
Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
Community
We propose DreamO, a unified image customization framework, which covers ID, IP, Tryon, and style tasks. DreamO performs well in character fidelity and multi-subject confusion. The model will be open sourced at https://github.com/bytedance/DreamO (within 1 week), please stay tuned.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing (2025)
- RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models (2025)
- InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework (2025)
- VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning (2025)
- UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer (2025)
- Less-to-More Generalization: Unlocking More Controllability by In-Context Generation (2025)
- SkyReels-A2: Compose Anything in Video Diffusion Transformers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper