Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Abstract
Temperature-Adjusted Cross-modal Attention (TACA) enhances text-image alignment in diffusion models by dynamically rebalancing multimodal interactions through temperature scaling and timestep-dependent adjustment.
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at https://github.com/Vchitect/TACA
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers (2025)
- AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment (2025)
- Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation (2025)
- RelationAdapter: Learning and Transferring Visual Relation with Diffusion Transformers (2025)
- VSC: Visual Search Compositional Text-to-Image Diffusion Model (2025)
- DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation (2025)
- Personalized Text-to-Image Generation with Auto-Regressive Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper