Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Abstract
ColorCtrl, a training-free method using Multi-Modal Diffusion Transformers, achieves precise and consistent color editing in images and videos with word-level control and superior performance compared to existing approaches.
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- S2Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control (2025)
- Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing (2025)
- LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing (2025)
- ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation (2025)
- CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing (2025)
- CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing (2025)
- Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper