Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation.
DreamFit has three key advantages:
To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both 768 × 512 high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.
## OverviewOur method constructs an Anything-Dressing Encoder utilizing LoRA layers. The reference image features are extracted by the Anything-Dressing Encoder and then passed into the denoising UNet via adaptive attention.
Furthermore, we incorporate Large Multimodal Models (LMM) into the inference process to reduce the text prompt gap between the training and testing.
## Installation Guide 1. Clone our repo: ```bash git clone https://github.com/bytedance/DreamFit.git ``` 2. Create new virtual environment: ```bash conda create -n dreamfit python==3.10 conda activate dreamfit ``` 3. Install our dependencies by running the following command: ```bash pip install -r requirements.txt pip install flash-attn --no-build-isolation --use-pep517 ``` ## Models 1. You can download the pretrained models [Here](https://huggingface.co/bytedance-research/Dreamfit). Download the checkpoint to `pretrained_models` folder. 2. If you want to inference with StableDiffusion1.5 version, you need to download the [stable-diffusion-v1-5](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5), [sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) to `pretrained_models`. If you want to generate images of different styles, you can download the corresponding stylized model, such as [RealisticVision](https://huggingface.co/SG161222/Realistic_Vision_V6.0_B1_noVAE), to `pretrained_models`. 3. If you want to inference with Flux version, you need to download the [flux-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) to `pretrained_models` folder 4. If you want to inference with pose control, you need to download the [Annotators](https://huggingface.co/lllyasviel/Annotators) to `pretrained_models` folder The folder structures should look like these: ``` ├── pretrained_models/ | ├── flux_i2i_with_pose.bin │ ├── flux_i2i.bin │ ├── flux_tryon.bin │ ├── sd15_i2i.ckpt | ├── stable-diffusion-v1-5/ | | ├── ... | ├── sd-vae-ft-mse/ | | ├── diffusion_pytorch_model.bin | | ├── ... | ├── Realistic_Vision_V6.0_B1_noVAE(or other stylized model)/ | | ├── unet/ | | | ├── diffusion_pytorch_model.bin | | | ├── ... | | ├── ... | ├── Annotators/ | | ├── body_pose_model.pth | | ├── facenet.pth | | ├── hand_pose_model.pth | ├── FLUX.1-dev/ | | ├── flux1-dev.safetensors | | ├── ae.safetensors | | ├── tokenizer | | ├── tokenizer_2 | | ├── text_encoder | | ├── text_encoder_2 | | ├── ... ``` ## Inference ### Garment-Centric Generation ``` bash # inference with FLUX version bash run_inference_dreamfit_flux_i2i.sh \ --cloth_path example/cloth/cloth_1.png \ --image_text "A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text." \ --save_dir "." \ --seed 164143088151 # inference with StableDiffusion1.5 version bash run_inference_dreamfit_sd15_i2i.sh \ --cloth_path example/cloth/cloth_3.jpg\ --image_text "A woman with curly hair wears a pink t-shirt with a logo and white stripes on the sleeves, paired with white trousers, against a plain white background."\ --ref_scale 1.0 \ --base_model pretrained_models/Realistic_Vision_V6.0_B1_noVAE/unet/diffusion_pytorch_model.bin \ --base_model_load_method diffusers \ --save_dir "." \ --seed 28 ``` Tips: 1. If you have multiple pieces of clothing, you can splice them onto one picture, as shown in the second row. 2. Use `--help` to check the meaning of each argument.Image Text | Cloth | Output |
---|---|---|
A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text. |
![]() |
![]() |
A young woman with a casual yet stylish look, wearing a blue top, black skirt, and comfortable cream slip-on shoes. |
![]() |
![]() |
Image Text | Cloth | Pose Image | Output |
---|---|---|---|
A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text. |
![]() |
![]() |
![]() |
Image Text | Cloth | Keep Image | Output |
---|---|---|---|
A woman wearing a white Bape T-shirt with a colorful ape graphic and bold text and a blue jeans. |
![]() |
![]() |
![]() |