Introduction

We introduce two learnable tokens in the prompt to generate complementary cross-attention maps. These tokens are designed to capture both the object and the background concept, which includes all regions outside the objects. We design a two-stage pipeline. Throughout the two stages, we set the prompt as "an aerial view image with [V1] [category] in [V2] [S]" for source domain data and "an aerial view image with [V1] [category] in [V3] [T]" for target domain data. [V1] represents the learnable token for the object concept, while [V2] and [V3] correspond to the learnable token for the source domain and target domain background concept, respectively. In the second stage, we fix the learned tokens [V1], [V2], and [V3] from the first stage, and further fine-tune the U-Net.

Model Usage

This folder contains the fine-tuned Stable Diffusion and the learnable embeddings in the second stage.

References

➡️ Paper: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
➡️ Project Page: Webpage
➡️ Code: AGenDA

Downloads last month
36
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xiaofanghf/AGenDA-Finetune-Tokens-Stage2

Finetuned
(1150)
this model

Collection including xiaofanghf/AGenDA-Finetune-Tokens-Stage2