moonshotai/Kimi-Audio-7B-Instruct · Issue with long audio (~1 min) output, or prompt instruct following

messages = [

    {
        "role": "user",
        "message_type": "audio",
        "content": "500ms-silence.mp3", # an audio file input is required in your demo code
    },
        {
        "role": "user",
        "message_type": "text",
        "content": "Text to speech\n\nDelivery: Exaggerated and theatrical, with dramatic pauses, sudden outbursts, and gleeful cackling.\n\nVoice: High-energy, eccentric, and slightly unhinged, with a manic enthusiasm that rises and falls unpredictably.\n\nTone: Excited, chaotic, and grandiose, as if reveling in the brilliance of a mad experiment.\n\nPronunciation: Sharp and expressive, with elongated vowels, sudden inflections, and an emphasis on big words to sound more diabolical.\n\nText:\nAh-ha-ha! The stars tremble before my genius! The rift is open, the energy surging—unstable? Perhaps. Dangerous? Most certainly! Captain Rylen's hands twitch over the controls. Fools! They hesitate, but I—I alone see the future! \"Engage the thrusters!\" I bellow, eyes wild with possibility. The ship lurches, metal groaning—oh, what delicious chaos! Light bends, time twists, and then—BOOM! Silence. Darkness. And then… oh-ho! A new universe! Bigger! Stranger! And mine for the taking! Ah-ha-ha-ha!",
    }
]

wav, text = model.generate(messages, **sampling_params, output_type="both")
sf.write(
    os.path.join(output_dir, "output.wav"),
    wav.detach().cpu().view(-1).numpy(),
    24000,
)
print(">>> output text: ", text)

The prompt, copied from openai.fm, works correctly with qwen-omni (although style control is ineffective with it) and also with a glm-voice model fine-tuned for long audio outputs.

However, your model fails to generate from the beginning of the text. Instead, it produces seemingly random sentences and disregards the specified style controls. This suggests the model may either be incapable of handling long audio (30s~1min) output or struggle with following prompt instructions.

Kimi:

Qwen-omni:

OpenAI TTS:

Model trained in the same arch as glm-voice: