Gemma 3 for OpenArc has landed!
My Project OpenArc, an inference engine for OpenVINO, now supports this model and serves inference over OpenAI compatible endpoints for text to text and text with vision! That release comes out today or tomorrow.
We have a growing Discord community of others interested in using Intel for AI/ML.
This model was converted to the OpenVINO IR format using the following Optimum-CLI command:
optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""
- Find documentation on the Optimum-CLI export process here
- Use my HF space Echo9Zulu/Optimum-CLI-Tool_tool to build commands and execute locally
What does the test code do?
Well, it demonstrates how to inference in python and what parts of that code are important for benchmarking performance. Text generation offers different challenges than text-generation with images; for examples, vision encoders often use different strategies for handling properties an image can have. In practice this translates to higher memory usage, reduced throughput or bad results.
To run the test code:
- Install device specific drivers
- Build Optimum-Intel for OpenVINO from source
- Find your spiciest images to get that AGI refusal smell
pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
import time
from PIL import Image
from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM
model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov" # Can be an HF id or a path
ov_config = {"PERFORMANCE_HINT": "LATENCY"} # Optimizes for first token latency and locks to single CPU socket
print("Loading model... this should get faster after the first generation due to caching behavior.")
print("")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config) # For GPU use "GPU.0"
processor = AutoProcessor.from_pretrained(model_id) # Instead of using AutoTokenizers we use AutoProcessor which routes to the appropriate input processor i.e, how does a model expect image tokens.
# Under the hood this takes care of model specific preprocessing and has functionality overlap with AutoTokenizers.
end_load_time = time.time()
image_path = r"" # This script expects .png
image = Image.open(image_path)
image = image.convert("RGB") # Required by gemma3. In practice this would need to be handled at the engine level OR in model-specifc pre-processing.
conversation = [
{
"role": "user",
"content": [
{
"type": "image"
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
input_token_count = len(inputs.input_ids[0])
print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")
start_time = time.time()
output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
num_tokens_generated = len(generated_ids[0])
load_time = end_load_time - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated
print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens : {input_token_count:>9}")
print(f"Generated Tokens : {num_tokens_generated:>9}")
print(f"Model Load Time : {load_time:>9.2f} sec")
print(f"Generation Time : {generation_time:>9.2f} sec")
print(f"Throughput : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token : {average_token_latency:>9.3f} sec")
print(output_text)
- Downloads last month
- 215