This model was converted using the kokoro notebook using weight only compression to FP16.

Kokoro-82M-FP16-OpenVINO only works on CPU due to an unsupported kernel operation ScatterNDUpdate which requires INT64.

Implementing GPU compilation (at least for A770) might require writing a custom export openvino_model.xml with the kernel logic to support this.

After hacking on this unsuccessfully for a few hours, I decided that I should not worry about these details, and use GPU for other tasks.

CPU performance with Kokoro is very fast.

Usage

To run Kokoro with OpenVINO you can

Use the notebook from OpenVINO Notebooks OR
Use the provided snippet, adapted from the notebook.

Required dependencies:

pip install kokoro misaki[en] openvino torch soundfile

Example usage, which runs one forward pass, timing model compilation and inference time.

We are using OpenVINO directly here, and most of the convieince libraries which use OpenVINO afford is excluded.

import json
import time
from pathlib import Path
import soundfile as sf

import torch
import openvino as ov
from kokoro.model import KModel
from kokoro.pipeline import KPipeline   

class OVKModel(KModel):
    def __init__(self, model_dir: Path, device: str):
        super().__init__()

        self.model_dir = Path(model_dir)

        # Load config.json
        with (self.model_dir / "config.json").open("r", encoding="utf-8") as f:
            config = json.load(f)

        self.vocab = config["vocab"]
        self.context_length = config["plbert"]["max_position_embeddings"]

        # Compile OpenVINO model
        start = time.perf_counter()
        core = ov.Core()
        self.model = core.compile_model(self.model_dir / "openvino_model.xml", device)
        self.compile_time_s = time.perf_counter() - start


    def forward_with_tokens(
        self,
        input_ids: torch.LongTensor,
        ref_s: torch.FloatTensor,
        speed: float = 1
    ) -> tuple[torch.FloatTensor, torch.LongTensor]:
        outputs = self.model([input_ids, ref_s, torch.tensor(speed)])
        return torch.from_numpy(outputs[0]), torch.from_numpy(outputs[1])



def time_inference(fn, *args, **kwargs):
    start = time.perf_counter()
    out = fn(*args, **kwargs)
    return out, (time.perf_counter() - start)

if __name__ == "__main__":
    model_path = Path("path/to/model")

    # Initialize model + pipeline
    ov_model = OVKModel(model_path, device="CPU")
    print(f"Compile time: {ov_model.compile_time_s * 1000:.2f} ms")
    pipeline = KPipeline(model=ov_model, lang_code="a")

    input_text = (
        """
        Kokoro doesn't do well with emotional language, so if you use it with an text model, 
        include something the LLM instructions to guide its language to be "emotionally neutral" 
        before piping to Kokoro. Actually, I have no idea if that's true, or even works but needed some text
        for this example.
        """
    )

    with torch.no_grad():
        generator = pipeline(input_text, voice="af_heart")
        result, elapsed_s = time_inference(next, generator)

    print(f"Generated audio with {len(result.audio)} samples at 24kHz")
    print(f"Inference time: {elapsed_s * 1000:.2f} ms")
    
    # Save as WAV file
    output_path = "kokoro_output.wav"
    sf.write(output_path, result.audio, 24000)  # 24kHz sample rate
    print(f"Audio saved to: {output_path}")

Downloads last month: 42

Model tree for Echo9Zulu/Kokoro-82M-FP16-OpenVINO

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Finetuned

(17)

this model