Image-Text-to-Text
Transformers
Safetensors
Cosmos
English
qwen2_5_vl
nvidia
conversational
text-generation-inference

Hi from the NVIDIA PM team

#2
by jpenningNVIDIA - opened
NVIDIA org

Hi HF community! πŸ‘‹

The NVIDIA PM team here. We're thrilled to share Cosmos-Reason1 with this community and can't wait to see what you build with it.

We'd love to hear your thoughts and experiences:

  1. What's working well for your use cases?
  2. What challenges are you running into?
  3. What features would unlock new possibilities for your projects?
  4. Tell us about your projects - we'd love to learn what you're building with Reason!

Whether you're just getting started or diving deep into advanced implementations, your feedback helps us make this better for everyone. Drop a comment, start a discussion, or tag us in your projects!

Looking forward to the conversations ahead! πŸš€

Hi,

Thanks for initiating this thread.

I'm encountering an issue where the model isn't fitting into either a 12GB GPU or Colab, which is quite disappointing. As a 7B model, and considering typical test-time compute requirements, I would expect it to occupy less than 10GB and be runnable.

To be honest, I haven't been able to properly load the model using:

  • FP8 or AWQ quantization (no AWQ config found). There was no clear indication that FP8 quantization was active, as it still occupied the same memory and resulted in an Out Of Memory (OOM) error.
  • cpu_offload_gb. This approach consistently threw errors.

Given that this is a Qwen2.5VL Instruct-based model, I checked their repository and adapted their code for inference instead of using a vLLM-based approach. For beginners like myself, the lack of clear information on how to directly use FP8 or AWQ can be a significant barrier and cause a loss of interest in testing the model. It would be beneficial if the model card included such details.

I've attached my adapted code, which now works fine. However, I'm still unsure if the response quality will vary, as I wasn't getting the expected output with sample.mp4 as described in the Git README.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from qwen_vl_utils import process_vision_info
import torch

# Model and quantization setup
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
# MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Load model
llm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Message structure
video_messages = [
    {"role": "system", "content": (
        "You are a helpful assistant. Answer the question in the following format:\n"
        "<think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."
    )},
    {"role": "user", "content": [
        {"type": "text", "text": "Is it safe to turn right?"},
        {"type": "video", "video": "assets/av_example.mp4", "fps": 4},
    ]},
]

# Processor and inputs
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    video_messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(video_messages, return_video_kwargs=True)


inputs = processor(
    text=[prompt],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(llm.device)



# mm_data = {}
# if image_inputs is not None:
#     mm_data["image"] = image_inputs
# if video_inputs is not None:
#     mm_data["video"] = video_inputs

# # Model input
# llm_inputs = {
#     "prompt": prompt,
#     "multi_modal_data": mm_data,
#     "mm_processor_kwargs": video_kwargs,
# }

# Generate
generated_ids = llm.generate(
    **inputs,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_new_tokens=4096,
)

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Occupies around 8.2GB

Cosmos Response:

["\nOkay, let's see. The user is asking if it's safe to turn right based on the scenario provided. First, I need to recall the details given. The driver is approaching a T-intersection and has come to a complete stop because there's no On-Coming Traffic. The video is a residential area with parked cars on both sides, driveways with vehicles entering and exiting, and crosswalks.\n\nSo, the key points here are that they've stopped completely. The main issue after stopping at a T-intersection without oncoming traffic is checking for any incoming vehicles from the direction they want to turn into. Since they shown there's no oncoming traffic, that suggests the way they're turning right doesn't have cars approaching from the left. But wait, in a T-intersection, when you're at the end of the road, the traffic coming from the main road would be on the left or right. Wait, maybe I should clarify. If you're at a T-intersection, the road you're on ends, and you have to turn either left or right onto the intersecting road. So if they're turning right, the oncoming traffic would be coming from the left on the intersecting road. But since there's no oncoming traffic, it's safe to proceed. However, they must also check for any pedestrians in the crosswalk. The scenario mentions crosswalks, so even though there's no traffic, pedestrians might be present. Also, there are parked cars and vehicles entering/exiting driveways. A vehicle could suddenly pull out as they're turning. So even if there's no oncoming traffic, they need to ensure that no cars are coming from the left or right, and no pedestrians are crossing. Since they've stopped and there's no oncoming traffic shown, but the presence of parked cars and driveways means they have to be cautious. However, the answer might hinge on whether the crosswalk is clear. The video shows crosswalks, so they should check for pedestrians. If there are no pedestrians and the way is clear, then it's safe. The initial assessment was that they can proceed safely after stopping. So the answer is yes, it's safe, but with the caveat of checking all directions. But the user's question is straightforward: is it safe to turn right? Based on the info given, since there's no oncoming traffic and they've stopped, the answer is yes. Unless there's an implicit danger not shown, like a pedestrian stepping out or a car from a driveway. But the scenario states there's no oncoming traffic, so probably safe. So the answer should be yes.\n\n\n\nYes, it is safe to turn right under the shown conditions. The driver has come to a complete stop, noted the absence of oncoming traffic, and the scenario emphasizes safety precautions (checking crosswalks, parked vehicles, and potential driveway movements). Since there are no active conflicts (e.g., pedestrians, vehicles approaching from the left/right), the path appears clear after stopping. However, the driver should still remain vigilant for unexpected hazards (e.g., a vehicle pulling out from a driveway or a pedestrian entering the crosswalk) before proceeding.\n"]

Qwen 2.5VL Instruct Response: (I'm aware that it is not a reasoning model with )

["\nTo determine if it is safe to turn right, we need to consider several factors:\n\n1. Traffic Conditions: There are no other vehicles immediately visible in the image that are approaching or waiting at the intersection.\n2. Visibility: The road appears clear ahead of the vehicle, and there are no obstructions that would obstruct the driver's view.\n3. Signs and Signals: There are no visible traffic lights or signs indicating that turning right is not allowed.\n\nGiven these observations, it seems safe to proceed with a right turn assuming there are no hidden dangers or restrictions beyond what is visible in the image.\n\n\n\nIt appears safe to turn right based on the current conditions visible in the image. However, it's always advisable to check for any hidden hazards or changes in conditions before making a turn.\n"]

Just a feedback:
Nvidia has released some of the greatest innovations for the community, but unclear documentation, breaking changes, and lack of backward compatibility are important issues that need to be considered from a developer's point of view.

Kudos to all the amazing teams across the globe at Nvidia for their outstanding work. πŸ˜„

Hi @Jaykumaran17 thanks for the comprehensive post. I'll get this to engineering to respond asap.

Hello,

I'm following up regarding this query on this thread: Issues #16

Sign up or log in to comment