Issue with long audio (~1 min) output, or prompt instruct following

#2
by JosephusCheung - opened
messages = [

    {
        "role": "user",
        "message_type": "audio",
        "content": "500ms-silence.mp3", # an audio file input is required in your demo code
    },
        {
        "role": "user",
        "message_type": "text",
        "content": "Text to speech\n\nDelivery: Exaggerated and theatrical, with dramatic pauses, sudden outbursts, and gleeful cackling.\n\nVoice: High-energy, eccentric, and slightly unhinged, with a manic enthusiasm that rises and falls unpredictably.\n\nTone: Excited, chaotic, and grandiose, as if reveling in the brilliance of a mad experiment.\n\nPronunciation: Sharp and expressive, with elongated vowels, sudden inflections, and an emphasis on big words to sound more diabolical.\n\nText:\nAh-ha-ha! The stars tremble before my genius! The rift is open, the energy surging—unstable? Perhaps. Dangerous? Most certainly! Captain Rylen's hands twitch over the controls. Fools! They hesitate, but I—I alone see the future! \"Engage the thrusters!\" I bellow, eyes wild with possibility. The ship lurches, metal groaning—oh, what delicious chaos! Light bends, time twists, and then—BOOM! Silence. Darkness. And then… oh-ho! A new universe! Bigger! Stranger! And mine for the taking! Ah-ha-ha-ha!",
    }
]

wav, text = model.generate(messages, **sampling_params, output_type="both")
sf.write(
    os.path.join(output_dir, "output.wav"),
    wav.detach().cpu().view(-1).numpy(),
    24000,
)
print(">>> output text: ", text)

The prompt, copied from openai.fm, works correctly with qwen-omni (although style control is ineffective with it) and also with a glm-voice model fine-tuned for long audio outputs.

However, your model fails to generate from the beginning of the text. Instead, it produces seemingly random sentences and disregards the specified style controls. This suggests the model may either be incapable of handling long audio (30s~1min) output or struggle with following prompt instructions.

Kimi:

Qwen-omni:

OpenAI TTS:

Model trained in the same arch as glm-voice:

我在这里再提供一个中文的bad case,在类似的支持语音(Whisper)输入,但是性能不佳的模型中很多有这个问题。这似乎不单是LLM幻觉的问题。

image.png

image.png

而语音似乎也出现了错字错音的问题,我不确定这是否是模型推理存在的问题,还是模型本身的问题。我倾向于归因为 audio head 存在细节问题,遇到上游ASR任务中的生僻词时也会出现生成的问题。

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment