Casktalk-VLM ( CaskTalk Vision Language Model)

Model Details

Developed by: ToriLab (CasTalk)
Model type: (based on LLaVA, + mistral-7b)

Usage

Presequities

pip install --upgrade pip
pip install transformers>=4.39.0

Inference

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = LlavaNextProcessor.from_pretrained("torilab/casktalk-vlm-v1.0")
model = LlavaNextForConditionalGeneration.from_pretrained(
    "torilab/casktalk-vlm-v1.0",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)
model.to(device)

We now pass the image and the text prompt to the processor, and then pass the processed inputs to the generate.

from PIL import Image
import requests

url = "<your_user_image>"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"

inputs = processor(prompt, image, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=100)

Call decode to decode the output tokens.

print(processor.decode(output[0], skip_special_tokens=True))

About -ToriLab

ToriLab builds reliable, practical, and scalable AI solutions for the CasTalk app.

phuongdv-VN ([email protected])
khanhvu-VN ([email protected])
hieptran-VN ([email protected])
tanaka-JP ([email protected])