MONKEY: Masking ON KEY-Value Activation Adapter for Personalization
Abstract
Using an automatically generated mask to restrict image tokens during inference improves prompt and source image alignment in personalized diffusion models.
Personalizing diffusion models allows users to generate new images that incorporate a given subject, allowing more control than a text prompt. These models often suffer somewhat when they end up just recreating the subject image, and ignoring the text prompt. We observe that one popular method for personalization, the IP-Adapter automatically generates masks that we definitively segment the subject from the background during inference. We propose to use this automatically generated mask on a second pass to mask the image tokens, thus restricting them to the subject, not the background, allowing the text prompt to attend to the rest of the image. For text prompts describing locations and places, this produces images that accurately depict the subject while definitively matching the prompt. We compare our method to a few other test time personalization methods, and find our method displays high prompt and source image alignment.
Community
This paper is a first draft at an attempt to refine IP-Adapter for better background-subject disentanglement. Feedback is always welcome!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Griffin: Generative Reference and Layout Guided Image Composition (2025)
- ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation (2025)
- MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation (2025)
- SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet (2025)
- EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model (2025)
- SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress (2025)
- ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper