balacoon/vq4_50fps_24khz_vocoder

Balacoon Discrete Vocoder

This discrete vocoder consists of both analysis and synthesis components.

Analysis: Converts audio into audio tokens—four parallel codebooks, each containing 2,048 values.
Synthesis: Converts audio tokens back into audio.

The vocoder operates with 24 kHz audio at a frame rate of 50. It is designed as a middle ground between the high bitrate of EnCodec and the lower bitrate alternatives like Mimi (12.5 frames per second) or WaveTokenizer (which uses a single codebook).

How to Use the Vocoder:

import torch
import soundfile as sf
from huggingface_hub import hf_hub_download

device = torch.device('cuda')

# load the model
encoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="analysis.jit")
decoder_path = hf_hub_download(repo_id="balacoon/vq4_50fps_24khz_vocoder", filename="synthesis.jit")
encoder = torch.jit.load(encoder_path)
decoder = torch.jit.load(decoder_path)

# read the audio 
orig_audio_npy, sr = sf.read(path, dtype="int16")
assert sr == 24000
orig_audio = torch.tensor(orig_audio_npy).to(device).unsqueeze(0)  # batch x samples
# extract audio tokens from the audio
tokens = encoder(orig_audio)  # batch x frames x 4
# synthesize audio from audio tokens
resynthesized_audio = decoder(tokens)  # batch x samples

See performance of the codec on vocoder leaderboard: TTSLeaderboard