scb10x
/

llama-3-typhoon-v1.5-8b-vision-preview

+---
+inference: false
+language:
+- th
+- en
+library_name: transformers
+tags:
+- instruct
+- chat
+license: llama3
+---
+# Typhoon Vision Research Preview
+This is the research preview of Typhoon Vision.
+Typhoon Vision is family of Vision Language Models (VLM) specificially built for the 🇹🇭 Thai Language and Thai culture.
+Here we provide **Llama3 Typhoon Instruct Vision Preview** which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
+# **Model Description**
+- **Model type**: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
+- **Requirement**: transformers 4.38.0 or newer.
+- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
+- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)
+# Quickstart
+Here we show a code snippet to show you how to use the model with transformers.
+Before running the snippet, you need to install the following dependencies:
+```shell
+pip install torch transformers accelerate pillow
+```
+If the CUDA memory is enough, it would be faster to execute this snippet by setting `CUDA_VISIBLE_DEVICES=0`.
+```python
+import torch
+import transformers
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from PIL import Image
+import warnings
+import io
+import requests
+# disable some warnings
+transformers.logging.set_verbosity_error()
+transformers.logging.disable_progress_bar()
+warnings.filterwarnings('ignore')
+# set device
+device = 'cuda'  # or cpu
+torch.set_default_device(device)
+# create model
+model = AutoModelForCausalLM.from_pretrained(
+    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
+    torch_dtype=torch.float16, # float32 for cpu
+    device_map='auto',
+    trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained(
+    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
+    trust_remote_code=True)
+def prepare_inputs(text, has_image=False, device='cuda'):
+    messages = [
+        {"role": "system", "content": "You are a helpful vision-capable assistant who eagerly converses with the user in their language."},
+    ]
+    if has_image:
+        messages.append({"role": "user", "content": "<|image|>\n" + text})
+    else:
+        messages.append({"role": "user", "content": text})
+    inputs_formatted = tokenizer.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=False
+    )
+    text_chunks = [tokenizer(chunk).input_ids for chunk in inputs_formatted.split('<|image|>')]
+    input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)
+    attention_mask = torch.ones_like(input_ids).to(device)
+    return input_ids, attention_mask
+prompt = 'บอกทุกอย่างที่เห็นในรูป'
+img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
+image = Image.open(io.BytesIO(requests.get(img_url).content))
+image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
+input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)
+# generate
+output_ids = model.generate(
+    input_ids,
+    images=image_tensor,
+    max_new_tokens=1000,
+    use_cache=True,
+    temperature=0.2,
+    top_p=0.2,
+    repetition_penalty=1.0 # increase this to avoid chattering,
+)[0]
+print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
+```