gemma3n-audio-encoder-whisper-decoder

Combine mesolitica/gemma-3n-e4b-it-audio-encoder Encoder + Projection + openai/whisper-large-v3-turbo Decoder.

This model use to introduce VQ for projection layer later.

WanDB at https://wandb.ai/huseinzol05/gemma3n-audio-whisper-decoder-v2

Training dataset

  1. malaysia-ai/common_voice_17_0
  2. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
  3. mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments

how to use

from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa

model_id = "mesolitica/gemma3n-audio-encoder-whisper-decoder"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)

y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
    '<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>', 
    add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt')
features['input_features'] = features['input_features'].cuda()
features['input_features_mask'] = features['input_features_mask'].cuda()
features['attention_mask'] = features['input_features_mask']
features['decoder_input_ids'] = input_ids.cuda()

generate_kwargs = dict(
    **features,
    max_new_tokens=1024,
    temperature=0.1,
    do_sample=True
)
generation_output = model.generate(**generate_kwargs)
print(tokenizer.decode(generation_output[0]))

Output,

<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Кубы сыраохта был халя гешенең битарафлыгы сәпәпсем.<|endoftext|>

Evaluation

Evaluate on malaysia-ai/common_voice_17_0/test, with some conditions,

  1. Lower case.
  2. Remove punctuation.
  3. Provide language tagging for decoder input ids, <|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.

Source code

Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/gemma3n-audio-whisper-decoder

Downloads last month
302
Safetensors
Model size
859M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including mesolitica/gemma3n-audio-encoder-whisper-decoder