Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub.

License: MIT

Model Sources

Repository: repo
Demo: demo

How to Get Started with the Model

Use the code below to get started with the model.

git clone https://github.com/ryota-komatsu/speech_resynth.git
cd speech_resynth

sudo apt install git-lfs  # for UTMOS

conda create -y -n py39 python=3.9.21 pip=24.0
conda activate py39
pip install -r requirements/requirements.txt

sh scripts/setup.sh  # download textlesslib and UTMOS

cd src/textlesslib
pip install -e .
cd -

import torchaudio
from textless.data.speech_encoder import SpeechEncoder

from src.flow_matching.models import ConditionalFlowMatchingWithBigVGan

wav_path = "/path/to/wav"

encoder = SpeechEncoder.by_name(
    dense_model_name="mhubert-base-vp_mls_cv_8lang",
    quantizer_model_name="kmeans-expresso",
    vocab_size=2000,
    deduplicate=False,
    need_f0=False,
).cuda()

# download a pretrained model from hugging face hub
decoder = ConditionalFlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/flow_matching_with_bigvgan").cuda()

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into pseudo-phonetic units
units = encoder(waveform.cuda())["units"]
units = units.unsqueeze(0) + 1  # 0: pad

# resynthesis
audio_values = decoder(units)

Training Data

16 kHz-downsampled LibriTTS-R train set

ryota-komatsu
/

flow_matching_with_bigvgan

Model Card for Model ID

Model Details

Model Description

Model Sources

How to Get Started with the Model

Training Data

Dataset used to train ryota-komatsu/flow_matching_with_bigvgan