Audio Classification
Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.
About Audio Classification
Use Cases
Command Recognition
Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.
As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:
'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'
Speechbrain models can easily perform this task with just a couple of lines of code!
from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(
"speechbrain/google_speech_command_xvector"
)
model.classify_file("file.wav")
Language Identification
Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example modeltrained on VoxLingua107.
Emotion recognition
Emotion recognition is self explanatory. In addition to trying the widgets, you can use Inference Endpoints to perform audio classification. Here is a simple example that uses a HuBERT model fine-tuned for this task.
import json
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/superb/hubert-large-superb-er"
def query(filename):
with open(filename, "rb") as f:
data = f.read()
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8"))
data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]
You can use huggingface.js to infer with audio classification models on Hugging Face Hub.
import { HfInference } from "@huggingface/inference";
const inference = new HfInference(HF_TOKEN);
await inference.audioClassification({
data: await (await fetch("sample.flac")).blob(),
model: "facebook/mms-lid-126",
});
Speaker Identification
Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with this model. A useful dataset for this task is VoxCeleb1.
Solving audio classification for your own data
We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. Facebook's Wav2Vec2 XLS-R model is a large multilingual model trained on 128 languages and with 436K hours of speech. Similarly, you can also use OpenAI's Whisper trained on up to 4 Million hours of multilingual speech data for this task too!
Useful Resources
Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!
Notebooks
Scripts for training
Documentation
Compatible libraries
Note An easy-to-use model for command recognition.
Note An emotion recognition model.
Note A language identification model.
Note A benchmark of 10 different audio tasks.
Note A dataset of YouTube clips and their sound categories.
Note An application that can classify music into different genre.
- accuracy
- Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative
- recall
- Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.
- precision
- Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
- f1
- The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)