MultiRef: Controllable Image Generation with Multiple Visual References
Abstract
Experiments with multiple image-text models and agentic frameworks show that even state-of-the-art systems struggle with generating images from multiple visual references, highlighting the need for more flexible creative tools.
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.
Community
๐ New preprint: MultiRef enables controllable image generation using MULTIPLE visual references! ๐จโจ
Gone are the days of single-reference limitations - now you can blend and control multiple visual inputs for precise image synthesis ๐ฅ
Accepted to ACM MM 2025!
Dataset: https://huggingface.co/datasets/ONE-Lab/MultiRef-dataset
Benchmark: https://huggingface.co/datasets/ONE-Lab/MultiRef-benchmark
Project Homepage: https://multiref.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design (2025)
- ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation (2025)
- Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation (2025)
- MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models (2025)
- OmniGen2: Exploration to Advanced Multimodal Generation (2025)
- IC-Custom: Diverse Image Customization via In-Context Learning (2025)
- VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper