Model Details

This is a text-to-audio grounding model. Given an audio clip and a text prompt describing a sound event, the model predicts the event's probability with a time resolution of 40ms.

Compared to this version, we made few changes:

The model is trained on the larger AudioCaps v2
The text encoder is a frozen RoBERTa from LAION-CLAP (so the parameter becomes larger and most parameters are from the text encoder)
Optimized trivial training configurations

Usage

The usage is the same as the previous version:

import torch
import torchaudio
from transformers import AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(
    "wsntxxn/cnn8rnn-laionclap-audiocapsv2-grounding",
    trust_remote_code=True
).to(device)

wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True).to(device)

text = ["a man speaks", "a dog is barking"]

with torch.no_grad():
    output = model(
        audio=wav_batch,
        audio_len=[wav1.size(0), wav2.size(0)],
        text=text
    )
    # output: (2, n_seconds * 25)

Citation

@article{xu2024towards,
    title={Towards Weakly Supervised Text-to-Audio Grounding},
    author={Xu, Xuenan and Ma, Ziyang and Wu, Mengyue and Yu, Kai},
    journal={arXiv preprint arXiv:2401.02584},
    year={2024}
}

wsntxxn
/

cnn8rnn-laionclap-audiocapsv2-grounding

Model Details

Usage

Citation

Dataset used to train wsntxxn/cnn8rnn-laionclap-audiocapsv2-grounding