Experience with Phi-4-Multimodal vs. Whisper-1 for Speech-to-Text
I tested phi-4-multimodal on Azure's hosted option (services.ai.azure.com) for speech-to-text and compared it to whisper-1. My results were disappointing—Whisper performed much better.
Setup:
- Model: phi-4-multimodal (Azure-hosted)
- Audio: 2-minute chunks of German speech
- Prompt (from the model's code sample):
Based on the attached audio, generate a comprehensive text transcription of the spoken content. - Implementation details:
StackOverflow post
Question:
The paper was very promising in terms of transcription performance, and I’d love to see that in practice.
How can I achieve better transcription accuracy with phi-4-multimodal?
Are there preprocessing tricks or settings that improve performance?
Would love to hear from anyone who has made this work well!
Hi @hdevio ,
Thanks for the interest in Phi4-MultiModal. The best general prompt for speech-to-text task is "Transcribe the audio clip into text."
Beyond prompt, what I can think of is the audio length limitation for the ASR task. The maximum length for ASR task is ideally 40s. You need to do VAD or hard segmentation yourself to the audio input, which should be similar to whisper model evaluation. For summarization prompt, the audio length can be up to 30 minutes.
Hope you can fix the issue soon!
Thanks!
@nguyenbh
first up, congratulations on releasing such a great model.
hey a lot of folks are looking at phi4 multimodal as a whisper replacement.
could you please share some resources on how to generally do large audio file transcription with translation, and timestamp decoding similar to whisper.
also additional show us how to do streaming decoding of audio files.
It is glad to hear that folks are trying to use Phi4-multimodal. Unfortunately, we don't add timestamp during Phi4-MM training so the timestamp decoding is not supported unless further finetuned with your own data.
The large audio files and streaming decoding can be referred to whisper solutions. For example,
For large audio files, transformers ASR pipeline support hard segmentation:
https://github.com/huggingface/transformers/blob/c1b9a11dd4be8af32b3274be7c9774d5a917c56d/src/transformers/pipelines/automatic_speech_recognition.py#L132
Using VAD can achieve better performance than hard segmentation though.
For streaming decoding, Phi4-MM doesn't naively support it. You can try incremental decoding like what was done in whisper.
https://github.com/openai/whisper/discussions/2
Thanks!
@fanruchao thanks for the quick response. i really appreciate it.
looking forward to building and seeing great ASR models with phi-4 MM.
cheers and happy building.