Echo9Zulu/Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov

There are several ways to inference OpenVINO models for text generation.

With OpenVINO GenAI:

import openvino_genai as ov_genai

model_dir = "path-to-your-converted-model"
pipe = ov_genai.LLMPipeline(
    model_dir,                 # Path to the model directory
      "CPU",                   # Define the device to use
)

generation_config = ov_genai.GenerationConfig(
    max_new_tokens=128
)

prompt = "We don't even have a chat template so strap in and let it ride!"

result = pipe.generate([prompt], generation_config=generation_config)
perf_metrics = result.perf_metrics

print(f'Load time: {perf_metrics.get_load_time() / 1000:.2f} s')
print(f'Time to first token: {perf_metrics.get_ttft().mean / 1000:.2f} s')
print(f'Time per token: {perf_metrics.get_tpot().mean:.2f} ms/token')
print(f'Throughput: {perf_metrics.get_throughput().mean:.2f} tokens/s')
print(f'Generate duration: {perf_metrics.get_generate_duration().mean / 1000:.2f} s')


print(result)

And with Optimum-Intel, which is an OpenVINO integration for Transformers.

Echo9Zulu
/

Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov

There are several ways to inference OpenVINO models for text generation.

Model tree for Echo9Zulu/Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov

Collection including Echo9Zulu/Cydonia-24B-v2.1-int4_asym-awq-wqe-se-ov

OpenVINO-Mistral