nvidia
/

Cosmos-Reason1-7B

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions Community

tsungyi commited on 16 days ago

Commit

5cace25

·

verified ·

1 Parent(s): 6554ea3

Update README.md

Files changed (1) hide show

README.md +67 -2

README.md CHANGED Viewed

@@ -173,9 +173,74 @@ We release text annotations for all embodied reasoning datasets and videos for R
 ## Inference:
-**Acceleration Engine:** PyTorch, flash attention <br>
 **Test Hardware:** H100, A100, GB200 <br>
-* Minimum 2 GPU cards, multi nodes require Infiniband / ROCE connection <br>
 ## Ethical Considerations

 ## Inference:
 **Test Hardware:** H100, A100, GB200 <br>
+```python
+from transformers import AutoProcessor
+from vllm import LLM, SamplingParams
+from qwen_vl_utils import process_vision_info
+# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
+MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
+llm = LLM(
+    model=MODEL_PATH,
+    limit_mm_per_prompt={"image": 10, "video": 10},
+)
+sampling_params = SamplingParams(
+    temperature=0.6,
+    top_p=0.95,
+    repetition_penalty=1.05,
+    max_tokens=4096,
+)
+video_messages = [
+    {"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
+    {"role": "user", "content": [
+            {"type": "text", "text": (
+                    "Is it safe to turn right?"
+                )
+            },
+            {
+                "type": "video",
+                "video": "file:///path/to/your/video.mp4",
+                "fps": 4,
+            }
+        ]
+    },
+]
+# Here we use video messages as a demonstration
+messages = video_messages
+processor = AutoProcessor.from_pretrained(MODEL_PATH)
+prompt = processor.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
+mm_data = {}
+if image_inputs is not None:
+    mm_data["image"] = image_inputs
+if video_inputs is not None:
+    mm_data["video"] = video_inputs
+llm_inputs = {
+    "prompt": prompt,
+    "multi_modal_data": mm_data,
+    # FPS will be returned in video_kwargs
+    "mm_processor_kwargs": video_kwargs,
+}
+outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
 ## Ethical Considerations