Speech Tokenizer
Collection
Multilingual discrete speech tokenizer for LLM.
•
6 items
•
Updated
Combine mesolitica/gemma-3n-e4b-it-audio-encoder Encoder + Projection + openai/whisper-large-v3-turbo Decoder.
This model use to introduce VQ for projection layer later.
WanDB at https://wandb.ai/huseinzol05/gemma3n-audio-whisper-decoder-v2
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/gemma3n-audio-encoder-whisper-decoder"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
'<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>',
add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt')
features['input_features'] = features['input_features'].cuda()
features['input_features_mask'] = features['input_features_mask'].cuda()
features['attention_mask'] = features['input_features_mask']
features['decoder_input_ids'] = input_ids.cuda()
generate_kwargs = dict(
**features,
max_new_tokens=1024,
temperature=0.1,
do_sample=True
)
generation_output = model.generate(**generate_kwargs)
print(tokenizer.decode(generation_output[0]))
Output,
<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Кубы сыраохта был халя гешенең битарафлыгы сәпәпсем.<|endoftext|>
Evaluate on malaysia-ai/common_voice_17_0/test, with some conditions,
<|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>
.
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/gemma3n-audio-whisper-decoder