SpeechLLM
Collection
Multi-Modal speech LLMs
•
2 items
•
Updated
•
1
SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav", #16k Hz, mono
audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
# Model Generation
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
Try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents(User Speech -> Response).
Dataset | Type | Word Error Rate | Gender Acc | Age Acc | Accent Acc |
---|---|---|---|---|---|
librispeech-test-clean | Read Speech | 11.51 | 0.9594 | ||
librispeech-test-other | Read Speech | 16.68 | 0.9297 | ||
CommonVoice test | Diverse Accent, Age | 26.02 | 0.9476 | 0.6498 | 0.8121 |
@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}