llava-hf
/

llava-onevision-qwen2-72b-ov-chat-hf

Image-Text-to-Text

llava_onevision

Inference Endpoints

Model card Files Files and versions Community

RaushanTurganbay HF staff commited on Jan 8

Commit

32e713a

·

verified ·

1 Parent(s): 70ff051

Update pipeline example

Files changed (1) hide show

README.md +8 -19

README.md CHANGED Viewed

@@ -9,7 +9,6 @@ tags:
 datasets:
 - lmms-lab/LLaVA-OneVision-Data
 pipeline_tag: image-text-to-text
-inference: false
 arxiv: 2408.03326
 ---
 # LLaVA-Onevision Model Card
@@ -54,35 +53,25 @@ The model supports multi-image and multi-prompt generation. Meaning that you can
 Below we used [`"llava-hf/llava-onevision-qwen2-72b-ov-chat-hf"`](https://huggingface.co/llava-hf/llava-onevision-qwen2-72b-ov-chat-hf) checkpoint.
 ```python
-from transformers import pipeline, AutoProcessor
-from PIL import Image
-import requests
-model_id = "llava-hf/llava-onevision-qwen2-72b-ov-chat-hf"
-pipe = pipeline("image-to-text", model=model_id)
-processor = AutoProcessor.from_pretrained(model_id)
-url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
-# Each value in "content" has to be a list of dicts with types ("text", "image")
-conversation = [
     {
       "role": "user",
       "content": [
           {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
-          {"type": "image"},
         ],
     },
 ]
-prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
-print(outputs)
->>> {"generated_text": "user\n\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nassistant\nLava"}
 ```
 ### Using pure `transformers`:
 Below is an example script to run generation in `float16` precision on a GPU device:

 datasets:
 - lmms-lab/LLaVA-OneVision-Data
 pipeline_tag: image-text-to-text
 arxiv: 2408.03326
 ---
 # LLaVA-Onevision Model Card
 Below we used [`"llava-hf/llava-onevision-qwen2-72b-ov-chat-hf"`](https://huggingface.co/llava-hf/llava-onevision-qwen2-72b-ov-chat-hf) checkpoint.
 ```python
+from transformers import pipeline
+pipe = pipeline("image-text-to-text", model="llava-onevision-qwen2-72b-ov-chat-hf")
+messages = [
     {
       "role": "user",
       "content": [
+          {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"},
           {"type": "text", "text": "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"},
         ],
     },
 ]
+out = pipe(text=messages, max_new_tokens=20)
+print(out)
+>>> [{'input_text': [{'role': 'user', 'content': [{'type': 'image', 'url': 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg'}, {'type': 'text', 'text': 'What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud'}]}], 'generated_text': 'Lava'}]
 ```
 ### Using pure `transformers`:
 Below is an example script to run generation in `float16` precision on a GPU device: