Discord

There are several ways to inference OpenVINO models for text generation.

With OpenVINO GenAI:

import openvino_genai as ov_genai

model_dir = "path-to-your-converted-model"
pipe = ov_genai.LLMPipeline(
    model_dir,                 # Path to the model directory
      "CPU",                   # Define the device to use
)

generation_config = ov_genai.GenerationConfig(
    max_new_tokens=128
)

prompt = "We don't even have a chat template so strap in and let it ride!"

result = pipe.generate([prompt], generation_config=generation_config)
perf_metrics = result.perf_metrics

print(f'Load time: {perf_metrics.get_load_time() / 1000:.2f} s')
print(f'Time to first token: {perf_metrics.get_ttft().mean / 1000:.2f} s')
print(f'Time per token: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
print(f'Generate duration: {perf_metrics.get_generate_duration().mean / 1000:.2f} s')


print(result)

And with Optimum-Intel, which is an OpenVINO integration for Transformers.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Echo9Zulu/Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov

Finetuned
(9)
this model

Collection including Echo9Zulu/Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov