xiaofanghf's picture
Upload README.md with huggingface_hub
aff39df verified
metadata
license: cc-by-nc-4.0
language:
  - en
base_model:
  - CompVis/stable-diffusion-v1-4
pipeline_tag: text-to-image
library_name: diffusers

Introduction

We introduce two learnable tokens in the prompt to generate complementary cross-attention maps. These tokens are designed to capture both the object and the background concept, which includes all regions outside the objects. We design a two-stage pipeline. Throughout the two stages, we set the prompt as "an aerial view image with [V1] [category] in [V2] [S]" for source domain data and "an aerial view image with [V1] [category] in [V3] [T]" for target domain data. [V1] represents the learnable token for the object concept, while [V2] and [V3] correspond to the learnable token for the source domain and target domain background concept, respectively. In the second stage, we fix the learned tokens [V1], [V2], and [V3] from the first stage, and further fine-tune the U-Net.

Model Usage

This folder contains the fine-tuned Stable Diffusion and the learnable embeddings in the second stage.

References

➡️ Paper: Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
➡️ Project Page: Webpage
➡️ Code: AGenDA