Model Details
This is a CRNN sound event detection model pre-trained on AudioSet and then finetuned on AudioSet-strong. It contains 8 convolution layers and a GRU, with a time resolution of 40ms and a total of about 6.4 million parameters.
Usage
import torch
from transformers import AutoModel
import torchaudio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(
"wsntxxn/cnn8rnn-audioset-sed",
trust_remote_code=True
).to(device)
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]
wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]
wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)
with torch.no_grad():
output = model(waveform=wav_batch)
# output: {
# "framewise_output": (2, 447, n_frames),
# "clipwise_output": (2, 447)
# }
# classes is in `model.classes`
# for example, the probability sequence of male speech is:
male_speech_prob = output[:, model.classes.index("Male speech, man speaking"), :]
- Downloads last month
- 315