Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Abstract
A new framework, Text Preference Optimization (TPO), aligns text-to-image models with human preferences without requiring paired image preference data, improving text-to-image alignment and human preference scores.
Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Better Optimization For Listwise Preference in Diffusion Models (2025)
- TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement (2025)
- PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting (2025)
- Instant Preference Alignment for Text-to-Image Diffusion Models (2025)
- Iterative Prompt Refinement for Safer Text-to-Image Generation (2025)
- IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance (2025)
- ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper