Experience with Phi-4-Multimodal vs. Whisper-1 for Speech-to-Text

#39

by hdevio - opened Mar 11

Mar 11

I tested phi-4-multimodal on Azure's hosted option (services.ai.azure.com) for speech-to-text and compared it to whisper-1. My results were disappointing—Whisper performed much better.

Setup:

Model: phi-4-multimodal (Azure-hosted)
Audio: 2-minute chunks of German speech
Prompt (from the model's code sample):
Based on the attached audio, generate a comprehensive text transcription of the spoken content.
Implementation details:
StackOverflow post

Question:

The paper was very promising in terms of transcription performance, and I’d love to see that in practice.
How can I achieve better transcription accuracy with phi-4-multimodal?
Are there preprocessing tricks or settings that improve performance?
Would love to hear from anyone who has made this work well!

fanruchao

Mar 11

•

edited Mar 11

Hi @hdevio ,

Thanks for the interest in Phi4-MultiModal. The best general prompt for speech-to-text task is "Transcribe the audio clip into text."

Beyond prompt, what I can think of is the audio length limitation for the ASR task. The maximum length for ASR task is ideally 40s. You need to do VAD or hard segmentation yourself to the audio input, which should be similar to whisper model evaluation. For summarization prompt, the audio length can be up to 30 minutes.

Hope you can fix the issue soon!

Thanks!

StephennFernandes

Apr 5

@nguyenbh
first up, congratulations on releasing such a great model.
hey a lot of folks are looking at phi4 multimodal as a whisper replacement.

could you please share some resources on how to generally do large audio file transcription with translation, and timestamp decoding similar to whisper.

also additional show us how to do streaming decoding of audio files.

fanruchao

Apr 7

Hi @StephennFernandes

It is glad to hear that folks are trying to use Phi4-multimodal. Unfortunately, we don't add timestamp during Phi4-MM training so the timestamp decoding is not supported unless further finetuned with your own data.

The large audio files and streaming decoding can be referred to whisper solutions. For example,

For large audio files, transformers ASR pipeline support hard segmentation:
https://github.com/huggingface/transformers/blob/c1b9a11dd4be8af32b3274be7c9774d5a917c56d/src/transformers/pipelines/automatic_speech_recognition.py#L132

Using VAD can achieve better performance than hard segmentation though.

For streaming decoding, Phi4-MM doesn't naively support it. You can try incremental decoding like what was done in whisper.
https://github.com/openai/whisper/discussions/2

Thanks!

StephennFernandes

Apr 7

@fanruchao thanks for the quick response. i really appreciate it.

looking forward to building and seeing great ASR models with phi-4 MM.

cheers and happy building.

StephennFernandes

Apr 27

@fanruchao hey i came across the Phi-4-Multimodal paper and i noticed that in page 5 under the Speech and Audio Training section i noticed this line:

maximum audio exposed in training is 30s (375 tokens). If we consider the 128k context length for
language decoder, theoretically Phi-4-Multimodal can support a maximum 2.8 hours of audio as out
of the box inference. It is worth noting that we have not fine tuned the model on such long audio data
and it may need further fine tuning to practically support such use cases

is it possible to actually do long form transcriptions on the model as the paper mentioned without needed any sort of external engineering effort as you mentioned earlier ?

fanruchao

19 days ago

It can do but the performance should be bad as the out-of-box evaluation. You need further finetuning on long-form audio to make it work well.

stephenfernandess

19 days ago

•

edited 19 days ago

@fanruchao thanks for letting me know this.

on that note could you please show me some HF inference code on how to leverage the entire 128k context length and do inference on a 2.8 hours audio file.

I could then use the same to further finetune the model on long format audio

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment