Model Card for Model ID

Classifies voice input into 11 English Accents

Model Details

This model is a finetune of facebook/mms-lid-256 on the speech accent archive dataset

It classies voice into 11 English Accents:
"0": "African"
"1": "Australian"
"2": "British"
"3": "EastAsian"
"4": "EasternEuropean"
"5": "LatinAmerican"
"6": "MiddleEastern"
"7": "NorthAmerican"
"8": "SouthAsian"
"9": "SouthEastAsian"
"10": "WesternEuropean"

Uses

Because of the constraints of the dataset, the input audio should be saying the phrase for best prediction results:

Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

Direct Use

You can load the model using the ID vkao8264/mms-accent-predict with the Transformers package

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torchaudio
import torch

def load_and_preprocess_audio(path):

    waveform, sr = torchaudio.load(path)

    # Resample to 16kHz because mms uses Wav2Vec
    if sr != sample_rate:
        waveform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sample_rate)(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Remove channel dimension and convert to 1D
    waveform = waveform.squeeze(0)

    inputs = feature_extractor(
        waveform,
        sampling_rate=sample_rate,
        return_tensors="pt",
        padding="max_length",
        max_length=sample_rate * max_audio_length,
        truncation=True
    )

    return inputs.input_values

id_to_class = {
  0: "African",
  1: "Australian",
  2: "British",
  3: "EastAsian",
  4: "EasternEuropean",
  5: "LatinAmerican",
  6: "MiddleEastern",
  7: "NorthAmerican",
  8: "SouthAsian",
  9: "SouthEastAsian",
  10: "WesternEuropean" 
}

sample_rate = 16000
max_audio_length = 15

model = AutoModelForAudioClassification.from_pretrained("vkao8264/mms-accent-predict")
feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/mms-lid-256")

sample = "audio_input.mp3"
inputs = load_and_preprocess_audio(sample)

predictions = model(inputs)
pred_label = torch.argmax(predictions['logits']).item()

print(id_to_class[pred_label])

Training Details

Training Data

The whole training data consists of about 2000 unique audio samples from the speech accent archive, downloaded from kaggle Data is then further split into training and validation set of size 1698 and 425 respectively

Evaluation

Accuracy on the validation set: 0.86 (f1 score)

vkao8264
/

mms-accent-predict