SpeechLLM

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:

SpeechActivity : if the audio signal contains speech (True/False)
Transcript : ASR transcript of the audio
Gender of the speaker (Female/Male)
Age of the speaker (Young/Middle-Age/Senior)
Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

Try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents(User Speech -> Response).

Model Details

Developed by: Skit AI
Authors: Shangeth Rajaa, Abhinav Tushar
Language: English
Finetuned from model: WavLM, TinyLlama
Model Size: 1.5 B
Checkpoint: 1200 k steps (bs=1)
Adapters: r=8, alpha=16
lr : 1e-4
gradient accumulation steps: 8

Checkpoint Result

Dataset	Type	Word Error Rate	Gender Acc	Age Acc	Accent Acc
librispeech-test-clean	Read Speech	11.51	0.9594
librispeech-test-other	Read Speech	16.68	0.9297
CommonVoice test	Diverse Accent, Age	26.02	0.9476	0.6498	0.8121

Cite

@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}

skit-ai
/

speechllm-1.5B

SpeechLLM

Usage

Model Details

Checkpoint Result

Cite

Datasets used to train skit-ai/speechllm-1.5B

Collection including skit-ai/speechllm-1.5B

SpeechLLM

Evaluation results