Model Card for Model ID
Classifies voice input into 11 English Accents
Model Details
This model is a finetune of facebook/mms-lid-256 on the speech accent archive dataset
It classies voice into 11 English Accents:
"0": "African"
"1": "Australian"
"2": "British"
"3": "EastAsian"
"4": "EasternEuropean"
"5": "LatinAmerican"
"6": "MiddleEastern"
"7": "NorthAmerican"
"8": "SouthAsian"
"9": "SouthEastAsian"
"10": "WesternEuropean"
Uses
Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results:
Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Direct Use
You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torchaudio
import torch
def load_and_preprocess_audio(path):
waveform, sr = torchaudio.load(path)
# Resample to 16kHz because mms uses Wav2Vec
if sr != sample_rate:
waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Remove channel dimension and convert to 1D
waveform = waveform.squeeze(0)
inputs = feature_extractor(
waveform,
sampling_rate=sample_rate,
return_tensors="pt",
padding="max_length",
max_length=sample_rate * max_audio_length,
truncation=True
)
return inputs.input_values
id_to_class = {
0: "African",
1: "Australian",
2: "British",
3: "EastAsian",
4: "EasternEuropean",
5: "LatinAmerican",
6: "MiddleEastern",
7: "NorthAmerican",
8: "SouthAsian",
9: "SouthEastAsian",
10: "WesternEuropean"
}
sample_rate = 16000
max_audio_length = 15
model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256")
sample = "audio_input.mp3"
inputs = load_and_preprocess_audio(sample)
predictions = model(inputs)
pred_label = torch.argmax(predictions['logits']).item()
print(id_to_class[pred_label])
Training Details
Training Data
The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from kaggle Data is then further split into training and validation set of size 1698 and 425 respectively
Evaluation
Accuracy on the validation set: 0.86 (f1 score)
- Downloads last month
- 0
Model tree for vkao8264/mms-accent-predict
Base model
facebook/mms-lid-256