Model Card for Model ID

Model Details

Model Description

  • Model type: Flow matching-based Diffusion Transformer with a BigVGAN vocoder

Model Sources

How to Get Started with the Model

Use the code below to get started with the model.

git clone https://github.com/ryota-komatsu/speaker_disentangled_hubert.git
cd speaker_disentangled_hubert

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.18 pip=24.0 faiss-gpu=1.11.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh
import torchaudio

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import S5HubertForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = S5HubertForSyllableDiscovery.from_pretrained("ryota-komatsu/s5-hubert", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/s5-hubert-decoder", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.to(encoder.device))

# syllabic units
units = outputs[0]["units"]  # [3950, 67, ..., 503]
units = units.unsqueeze(0)

# unit-to-speech synthesis
audio_values = decoder(units)

Training Hyperparameters

  • Training regime: fp16 mixed precision

Hardware

1 x NVIDIA RTX A6000

Model Card Authors

Ryota Komatsu

Downloads last month
131
Safetensors
Model size
34.5M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ryota-komatsu/s5-hubert-decoder

Finetuned
(1)
this model

Collection including ryota-komatsu/s5-hubert-decoder