Provided code snippet not working?
Result:
Loading checkpoint shards: 100%|ββββββββββββββββββββββββ| 5/5 [00:03<00:00, 1.30it/s]
['']
Hi
Iβve prepared a quick demo script to help address the issues youβre experiencing. Please note that this is a rapid test, not a fully optimized solution. While it demonstrates the core functionality, I recommend reviewing it carefully and adapting it to your specific needs.
Also, remember to install the transformers library directly from GitHub, as the model requires the latest version:
pip install git+https://github.com/huggingface/transformers accelerate
Sample code:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
Load the pre-trained model for visual-language conditional generation.
Configure it to use FP16 precision and Flash Attention v2 for efficient computation.
Automatically map the model to available devices (e.g., GPU if available).
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-VL-7B-Instruct",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
device_map="auto"
)
Ensure default tensor type matches torch.float16 to avoid type mismatches
torch.set_default_dtype(torch.float16)
Load the corresponding processor to handle tokenization, image, and video inputs.
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
Prepare the input message structure. This includes an image URL and a user request to describe it.
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
Process the textual component of the input message.
Apply the chat template, keeping the tokenization step deferred.
Add a generation prompt to guide the model's output generation.
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
Extract and process the visual information (images and videos) from the message.
image_inputs, video_inputs = process_vision_info(messages)
Create the input tensors required for the model.
This includes the processed text, images, and videos, with appropriate padding for batch processing.
Move the tensors to GPU for accelerated inference.
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
).to("cuda")
Generate the model's response.
Limit the output to a maximum of 128 new tokens.
generated_ids = model.generate(**inputs, max_new_tokens=128)
Decode the generated IDs into human-readable text.
Skip special tokens and avoid cleaning up tokenization spaces for accuracy.
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
Print the generated description of the image.
print("Generated description:", output_text[0])
Hope that helps.
M
Hi, I ran above modified code, But i am stuck with below error. Can you please suggest on how to fix? I am using T4 GPUs of Colab
Loadingβcheckpointβshards:β100%
β5/5β[00:46<00:00,ββ6.60s/it]
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
TypeError Traceback (most recent call last)
in <cell line: 0>()
55
56 # Inference: Generation of the output
---> 57 generated_ids = model.generate(**inputs, max_new_tokens=128)
58 # generated_ids_trimmed = [
59 # out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
19 frames
/usr/local/lib/python3.11/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py in apply_rotary_pos_emb_flashatt(tensor, freqs)
164 cos = freqs.cos()
165 sin = freqs.sin()
--> 166 output = apply_rotary_emb(tensor_, cos, sin).type_as(tensor)
167 return output
168
TypeError: 'NoneType' object is not callable