Introduction

We introduce two learnable tokens in the prompt to generate complementary cross-attention maps. These tokens are designed to capture both the object and the background concept, which includes all regions outside the objects. We design a two-stage pipeline. Throughout the two stages, we set the prompt as "an aerial view image with [V1] [category] in [V2] [S]" for source domain data and "an aerial view image with [V1] [category] in [V3] [T]" for target domain data. [V1] represents the learnable token for the object concept, while [V2] and [V3] correspond to the learnable token for the source domain and target domain background concept, respectively. In the second stage, we fix the learned tokens [V1], [V2], and [V3] from the first stage, and further fine-tune the U-Net.

Model Usage

This folder contains the fine-tuned Stable Diffusion and the learnable embeddings in the second stage.

References

➡️ Paper: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
➡️ Project Page: Webpage
➡️ Code: AGenDA

Downloads last month: 36

Model tree for xiaofanghf/AGenDA-Finetune-Tokens-Stage2

Base model

CompVis/stable-diffusion-v1-4

Finetuned

(1150)

this model

Collection including xiaofanghf/AGenDA-Finetune-Tokens-Stage2

AGenDA

Collection

Checkpoints for the paper "Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision" (https://arxiv.org/abs/2507.20976). • 8 items • Updated Sep 4 • 1