Hardware and vram requiremnt to run this model?

#18

by IronmanSnap - opened Feb 10

Discussion

IronmanSnap

Feb 10

Hi all, I have ordered a nvidia rtx 4090 GPU. It has 24GB of VRAM. Is it enough to run this model?

Corny335

Feb 11

Yes

atanasmatev

Feb 18

This is what we use as well. Runs without any issues (with flash_attention_2)

n3xt1lxs

Mar 6

i try run this code in GPU A30 24GB it say not enough vram gpu @Corny335

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Corny335

Mar 6

i am running it with vllm:
vllm serve "Qwen/Qwen2.5-VL-7B-Instruct" --max_model_len 8096 --limit-mm-per-prompt "image=5"

n3xt1lxs

Mar 6

•

edited Mar 6

i am running it with vllm:
vllm serve "Qwen/Qwen2.5-VL-7B-Instruct" --max_model_len 8096 --limit-mm-per-prompt "image=5"

it done?

and i try to config min_pixels max_pixels, it work now

RAOYAOLAO

Mar 11

i am running it with vllm:
vllm serve "Qwen/Qwen2.5-VL-7B-Instruct" --max_model_len 8096 --limit-mm-per-prompt "image=5"

it done?

and i try to config min_pixels max_pixels, it work now

Cound you kindly provide your code? With configuring min_pixels and max_pixels, I'm still facing Cuda OOM in RTX 4090. Really appreciate that!

sheoran95

Apr 21

Hi all! I'm trying to finetune this model with rtx 4090 24gb vram. However, I always get cuda out of memory error. I'm trying Lora training and using deepspeed to offload optimizer on cpu. What am I doing wrong?

yaobaishen

May 18

i am running it with vllm:
vllm serve "Qwen/Qwen2.5-VL-7B-Instruct" --max_model_len 8096 --limit-mm-per-prompt "image=5"

it done?

and i try to config min_pixels max_pixels, it work now

Cound you kindly provide your code? With configuring min_pixels and max_pixels, I'm still facing Cuda OOM in RTX 4090. Really appreciate that!

Same to me, I use the vllm command line above and face below issue on RTX 4090.

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

Corny335

May 31

•

edited May 31

I have only a 3090. Strange that 4090 behaves differently. Are you using flash-attn?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment