Introduction
We introduce two learnable tokens in the prompt to generate complementary cross-attention maps. These tokens are designed to capture both the object and the background concept, which includes all regions outside the objects. We design a two-stage pipeline. Throughout the two stages, we set the prompt as "an aerial view image with [V1] [category] in [V2] [S]" for source domain data and "an aerial view image with [V1] [category] in [V3] [T]" for target domain data. [V1] represents the learnable token for the object concept, while [V2] and [V3] correspond to the learnable token for the source domain and target domain background concept, respectively. In the second stage, we fix the learned tokens [V1], [V2], and [V3] from the first stage, and further fine-tune the U-Net.
Model Usage
This folder contains the fine-tuned Stable Diffusion and the learnable embeddings in the second stage.
References
➡️ Paper: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
➡️ Project Page: Webpage
➡️ Code: AGenDA
- Downloads last month
- 36
Model tree for xiaofanghf/AGenDA-Finetune-Tokens-Stage2
Base model
CompVis/stable-diffusion-v1-4