Qwen/Qwen-Image-Edit · HW requirements

Aug 18

Hello, ty again for the best models for the open source community. I have a question, what are the minimum VRAM requirements to run this model?

TheBigBlockPC

Aug 18

it runs on 24GB in nf4 with a image reselution of 1024x1024 but the visuals will be degraded a bit

perk11

Aug 18

How did you get it to run in nf4?

TheBigBlockPC

Aug 18

import os
from PIL import Image
import torch
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers import QwenImageEditPipeline
model_cache = "/media/kurama/1TB NVME SSD/huggingface_models/"

model_name = "Qwen/Qwen-Image"

# Load the pipeline
if torch.cuda.is_available():
    torch_dtype = torch.bfloat16
    device = "cuda:0"
else:
    torch_dtype = torch.float32
    device = "cpu"
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": torch.bfloat16,
        "bnb_4bit_use_double_quant": True
    },
    components_to_quantize=["transformer", "text_encoder"],  # names depend on pipeline
)
pipeline = QwenImageEditPipeline.from_pretrained(
    "Qwen/Qwen-Image-Edit",
    torch_dtype=torch_dtype,
    cache_dir=model_cache,
    quantization_config=pipeline_quant_config
)
pipeline.load_lora_weights(
    "LORA/Qwen-Image-Lightning-4steps-V1.0.safetensors"
)
print("pipeline loaded")
pipeline.to("cuda")
image = Image.open("./example.png").convert("RGB")
prompt = "render it in a topdown perspective"
inputs = {
    "image": image,
    "prompt": prompt,
    "generator": torch.manual_seed(1),
    "true_cfg_scale": 2,
    "negative_prompt": " ",
    "num_inference_steps": 4,
}

with torch.inference_mode():
    output = pipeline(**inputs)
    output_image = output.images[0]
    output_image.save("output_image_edit.png")
    print("image saved at", os.path.abspath("output_image_edit.png"))

you can download the lora from here: https://huggingface.co/lightx2v/Qwen-Image-Lightning
without it it won't work because the model doesn't quantize that well

here is a image generates with Qwen-Image to show how it looks in nf4

C0nsumption

Aug 19

So this works but there is most definitely some artifacts.
In the Qwen-Image repo, @OzzyGT uses TorchAO. Could this be a solution to reduce the artifacts? 🤕🤔

OzzyGT

Aug 19

•

edited Aug 19

both are correct, you can use torchao but if you want to use nf4 you should follow the same method I was using, you have to skip the transformer_blocks.0.img_mod layer or you will get degradation. It works with or without the lighting lora. It uses a little bit more than 17GB of VRAM with bitsandbytes and in 36s with a 3090 using the 8-steps lora.

prompt = "change the dog plushie for a cat preserving the background, the lighting, colors, shadows, also the cat plushie should have the same style of the dog plushie with the same eyes and lines."

source	lighting 8-steps	50steps

code:

import torch
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from transformers import Qwen2_5_VLForConditionalGeneration

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers import QwenImageEditPipeline, QwenImageTransformer2DModel
from diffusers.utils import load_image


model_id = "Qwen/Qwen-Image-Edit"
torch_dtype = torch.bfloat16
device = "cuda"

quantization_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_skip_modules=["transformer_blocks.0.img_mod"],
)
transformer = QwenImageTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
transformer = transformer.to("cpu")

quantization_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    subfolder="text_encoder",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
text_encoder = text_encoder.to("cpu")

pipe = QwenImageEditPipeline.from_pretrained(
    model_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
)

# optionally load LoRA weights to speed up inference
pipe.load_lora_weights("lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-8steps-V1.1.safetensors")
# pipe.load_lora_weights(
#     "lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors"
# )
pipe.enable_model_cpu_offload()

generator = torch.Generator(device="cuda").manual_seed(42)
image = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/resources/dog_plushie.png"
).convert("RGB")

prompt = "change the dog plushie for a cat preserving the background, the lighting, colors, shadows, also the cat plushie should have the same style of the dog plushie with the same eyes and lines."

# change steps to 8 or 4 if you used the lighting loras
image = pipe(image, prompt, num_inference_steps=8).images[0]

image.save("qwenimageedit.png")

C0nsumption

Aug 19

Been at it for hours. OMG, hype. Going to test out now!! First will test the code you just dropped though.
Thank you both for all your hard work 🤕🫱🏽‍🫲🏻

C0nsumption

Aug 19

•

edited Aug 19

Here is the TorchAO variant.
Works and takes about 22GB-23GB of VRAM.
HUGE SHOUT OUT TO OzzyGT (didn't @ cause I did earlier and don't want to be annoying.)

🤕🫱🏽‍🫲🏻 much love to everyone.

import torch
from PIL import Image
from diffusers import AutoModel, DiffusionPipeline, TorchAoConfig

model_cache = "/path/to/weights/Qwen-image-edit"
model_id = "/path/to/weights/Qwen-Image-Edit"
torch_dtype = torch.bfloat16
device = "cuda"

# TorchAO int8 weight-only on transformer
quantization_config = TorchAoConfig("int8wo")

transformer = AutoModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
pipe = DiffusionPipeline.from_pretrained(
    model_id, 
    transformer=transformer, 
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()


# optional LoRA (works with or without)
pipe.load_lora_weights("/path/to/weights/Qwen-Lora/Qwen-Image-Lightning-8steps-V1.1.safetensors")


prompt = "change the pickle in here hand to an eggplant while preserving the background, the lighting, colors, shadows, also the eggplant should have the same style as the cucumber"


generator = torch.Generator(device="cuda").manual_seed(42)
image = Image.open("./input.jpeg").convert("RGB")



# use 8 (or 4) steps if you're using the Lightning LoRA
image = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=8,
    generator=generator,
).images[0]

image.save("qwenimageedit_torchao.png")

NielsGx

Aug 19

I would wait for a FP8 scaled from Kijai.
Way better than a naive truncated FP8

C0nsumption

Aug 19

•

edited Aug 19

? 🤔.... it's like the literal precursor to quantizing it yourself.

God forbid... doing something on your own 🫱🏽‍🫲🏻
🏇🏾acc/acc

XuehangCang

Aug 20

H100 57.59G

vladciocan88

Aug 20

•

edited Aug 20

both are correct, you can use torchao but if you want to use nf4 you should follow the same method I was using, you have to skip the transformer_blocks.0.img_mod layer or you will get degradation. It works with or without the lighting lora. It uses a little bit more than 17GB of VRAM with bitsandbytes and in 36s with a 3090 using the 8-steps lora.

prompt = "change the dog plushie for a cat preserving the background, the lighting, colors, shadows, also the cat plushie should have the same style of the dog plushie with the same eyes and lines."

source	lighting 8-steps	50steps

code:

import torch
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from transformers import Qwen2_5_VLForConditionalGeneration

from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers import QwenImageEditPipeline, QwenImageTransformer2DModel
from diffusers.utils import load_image


model_id = "Qwen/Qwen-Image-Edit"
torch_dtype = torch.bfloat16
device = "cuda"

quantization_config = DiffusersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    llm_int8_skip_modules=["transformer_blocks.0.img_mod"],
)
transformer = QwenImageTransformer2DModel.from_pretrained(
    model_id,
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
transformer = transformer.to("cpu")

quantization_config = TransformersBitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    subfolder="text_encoder",
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
text_encoder = text_encoder.to("cpu")

pipe = QwenImageEditPipeline.from_pretrained(
    model_id, transformer=transformer, text_encoder=text_encoder, torch_dtype=torch_dtype
)

# optionally load LoRA weights to speed up inference
pipe.load_lora_weights("lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-8steps-V1.1.safetensors")
# pipe.load_lora_weights(
#     "lightx2v/Qwen-Image-Lightning", weight_name="Qwen-Image-Lightning-4steps-V1.0-bf16.safetensors"
# )
pipe.enable_model_cpu_offload()

generator = torch.Generator(device="cuda").manual_seed(42)
image = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/resources/dog_plushie.png"
).convert("RGB")

prompt = "change the dog plushie for a cat preserving the background, the lighting, colors, shadows, also the cat plushie should have the same style of the dog plushie with the same eyes and lines."

# change steps to 8 or 4 if you used the lighting loras
image = pipe(image, prompt, num_inference_steps=8).images[0]

image.save("qwenimageedit.png")

23s on 4090 Suprim X @18.8Gb

Yuvrajxms09

Aug 21

Heyy @vladciocan88 , thanks for the code and attached results. Btw I am pretty new to this quantization thing, so I was wondering how do you know what to quantize and what to skip in a way that doesn't affect the quality much? and can you pls suggest me some resources or something else which can help me learn quantization/pruning better.

C0nsumption

Aug 24

Heyy @vladciocan88 , thanks for the code and attached results. Btw I am pretty new to this quantization thing, so I was wondering how do you know what to quantize and what to skip in a way that doesn't affect the quality much? and can you pls suggest me some resources or something else which can help me learn quantization/pruning better.

Found these:

Repo that supposedly supports the conversion of:
["flux", "sd3", "ltxv", "hyvid", "wan", "hidream", "qwen"]
https://github.com/ngxson/diffusion-to-gguf/tree/main
This Japanese article that seems to run you through converting Wan2.2 to gguf
https://note.com/198619891990/n/n1ff8cf931d7a
The cpp library has a conversion script/tool
https://github.com/leejet/stable-diffusion.cpp
And the conversion scripts in here (the japanese article mentions them.)
https://github.com/ggml-org/llama.cpp

Noticing the python gguf library is used in both 1 and 2.
Too tired to poke around in the two other repos.
Will investigate more soon, can't sleep cause quantization and gguf are just gnawing at the back of my mind.
Hope this all helps, anything knowledge you can share would be much appreciated.

Blessings 🫶🏽
acc/acc

C0nsumption

Aug 24

Some more info:
All this gguf stuff seems to be built on ggml.

"GGUF is a file format for storing models for inference with GGML and executors based on GGML"

So you still train via pytorch or whatever but then convert to gguf for ggml inference for production....?
https://github.com/ggml-org/ggml

C0nsumption

Aug 24

Reading this script can probably give some better insight I think:
https://github.com/ngxson/diffusion-to-gguf/blob/main/custom_quants.py

I'm currently attempting it on Qwen-Image.
Will circle back.

C0nsumption

Aug 24

Anyone following along:

It works.
This is my own quantized variant of Qwen-Image at Q4_K_S (4 Bit).

Inference via diffusers. There is a ridiculous speed boost with loading the model, even with cpu offload enabled.

If you need help, lmk.
Will make notebooks.
Blessings 🫱🏽‍🫲🏻

OzzyGT

Aug 25

thanks a lot @Consumption A while ago I did some GGUF models for the lighting lora fused models here. I'm planning to do more so they're easily accessible. I need to look more into the edit model but at first it didn't work for me, have to dig more.

C0nsumption

Aug 25

No worries @OzzyGT . Qwen-image-Edit worked but at at Q4_K_S, they are some artifacts with certain kinds of prompts. It still maintains the ability to edit images and prompt adherence but it seems to struggle with realism (artifacts) but the quality for non realistic task holds sound, say turning an image into a different style like anime.

I'll share some results here later.

Yuvrajxms09

Aug 27

heyy guys, thank you so much for sharing the resources.

rockapaper

Sep 5

Anyone following along:

It works.
This is my own quantized variant of Qwen-Image at Q4_K_S (4 Bit).

Inference via diffusers. There is a ridiculous speed boost with loading the model, even with cpu offload enabled.

If you need help, lmk.
Will make notebooks.
Blessings 🫱🏽‍🫲🏻

can you please share your resources along with inference time

fcyrizz

Sep 10

I'm planning to run in the T4 GPU of colab, plz help me with the best settings and method, which quantization will be best

BingoBird

Sep 23

Can we get bigger quants on 2x 24GB 3090?

TheBigBlockPC

Sep 23

Can we get bigger quants on 2x 24GB 3090?

Yes if you quantize it it ruba ob a single 3090 in int4 but if you know how to run diffusers on multi GPU i would appreciate it

C0nsumption

Sep 26

@rockapaper forgive my delay, I got really sick and went on a break (studied graphics cause ML too hard when sick). I'm back now though.

https://github.com/ngxson/diffusion-to-gguf

That's the repo i used.
It was pretty straight forward but now that I know someone needs it, I'll make a notebook and tag you.

C0nsumption

Sep 26

I'm planning to run in the T4 GPU of colab, plz help me with the best settings and method, which quantization will be best

@fcyrizzI think that is 16 GB of VRAM 🤔
I would say bits and bytes and Q4

TheBigBlockPC

Sep 26

I'm planning to run in the T4 GPU of colab, plz help me with the best settings and method, which quantization will be best

@fcyrizzI think that is 16 GB of VRAM 🤔
I would say bits and bytes and Q4

16 GB of vram isn't enough quantize it to q4 and use CPU offload or use pruned variants that are 12b.
https://huggingface.co/OPPOer/Qwen-Image-Pruning

C0nsumption

Sep 26

I'm planning to run in the T4 GPU of colab, plz help me with the best settings and method, which quantization will be best

@fcyrizzI think that is 16 GB of VRAM 🤔
I would say bits and bytes and Q4

16 GB of vram isn't enough quantize it to q4 and use CPU offload or use pruned variants that are 12b.
https://huggingface.co/OPPOer/Qwen-Image-Pruning

Quantizing it isn't bounded by GPU, it's bounded by CPU and RAM.
The actual running of it is bounded by VRAM.