How to use the AFWhisper to extract audio features

#2
by GuoGuoBan - opened

Thank you for your contribution. But how can I use the AFWhisper to extract audio features?

You can load the weights using Whisper and do exactly the same steps you do with Whisper.

You can load the weights using Whisper and do exactly the same steps you do with Whisper.

I am using transformer package to load Whisper model weights as the link https://huggingface.co/openai/whisper-large-v3. Could you provide an example to how to load your model weights, please! Thank you very much!

I believe it is something like this:

import librosa
from transformers import WhisperProcessor
from transformers.models.whisper.modeling_whisper import WhisperEncoder

encoder = WhisperEncoder.from_pretrained("nvidia/audio-flamingo-3", subfolder="sound_tower")
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")

audio_array, sampling_rate = librosa.load("test.mp3", sr=16000)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

encoder_outputs = encoder(inputs.input_features)
    
print(f"Encoder output shape: {encoder_outputs.last_hidden_state.shape}")

I believe it is something like this:

import librosa
from transformers import WhisperProcessor
from transformers.models.whisper.modeling_whisper import WhisperEncoder

encoder = WhisperEncoder.from_pretrained("nvidia/audio-flamingo-3", subfolder="sound_tower")
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")

audio_array, sampling_rate = librosa.load("test.mp3", sr=16000)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")

encoder_outputs = encoder(inputs.input_features)
    
print(f"Encoder output shape: {encoder_outputs.last_hidden_state.shape}")

It works for me. Thank you!

Sign up or log in to comment