Rimecaster (en-US)

| | |

Rimecaster was developed by Rime Labs, trained with TTS tasks in mind and useful for speaker conditioning. This model extracts speaker embeddings from given speech, which can be the backbone for various TTS models. This model is adapted from Titanet-Large with a higher embedding dimension of 768 (up from 192).

Read more in the launch announcement blog post.

See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained("rimelabs/rimecaster")

Embedding Extraction

Using

emb = speaker_model.get_embedding("an255-fash-b.wav")

Extracting Embeddings for more audio files

To extract embeddings from a bunch of audio files:

Write audio files to a manifest.json file with lines as in format:

{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"}

Then running following script will extract embeddings and writes to current working directory:

python <NeMo_root>/examples/speaker_tasks/recognition/extract_speaker_embeddings.py --manifest=manifest.json --model_path='/path/to/.nemo/file'

Input

This model accepts 16000 KHz Mono-channel Audio (wav files) as input.

Output

This model provides speaker embeddings for an audio file.

Model Architecture

TitaNet model is a depth-wise separable conv1D model [1] for Speaker Verification and diarization tasks. You may find more info on the detail of this model here: TitaNet-Model.

Training

The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this example script and this base config.

References

[1] TitaNet: Neural Model for Speaker Representation with 1D Depth-wise Separable convolutions and global context [2] NVIDIA NeMo Toolkit

Licence

License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.

rimelabs
/

rimecaster