Rimecaster (en-US)
Rimecaster was trained with TTS tasks in mind and useful for speaker conditioning.
This model extracts speaker embeddings from given speech, which can be the backbone for various TTS models.
This model is adapted from Titanet-Large with a higher embedding dimension of 768 (up from 192).
See the model architecture section and NeMo documentation for complete architecture details.
NVIDIA NeMo: Training
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest Pytorch version.
pip install nemo_toolkit['all']
How to Use this Model
The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
Automatically instantiate the model
import nemo.collections.asr as nemo_asr
speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained("rimelabs/rimecaster")
Embedding Extraction
Using
emb = speaker_model.get_embedding("an255-fash-b.wav")
Extracting Embeddings for more audio files
To extract embeddings from a bunch of audio files:
Write audio files to a manifest.json
file with lines as in format:
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"}
Then running following script will extract embeddings and writes to current working directory:
python <NeMo_root>/examples/speaker_tasks/recognition/extract_speaker_embeddings.py --manifest=manifest.json --model_path='/path/to/.nemo/file'
Input
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
Output
This model provides speaker embeddings for an audio file.
Model Architecture
TitaNet model is a depth-wise separable conv1D model [1] for Speaker Verification and diarization tasks. You may find more info on the detail of this model here: TitaNet-Model.
Training
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this example script and this base config.
References
[1] TitaNet: Neural Model for Speaker Representation with 1D Depth-wise Separable convolutions and global context [2] NVIDIA NeMo Toolkit
Licence
License to use this model is covered by the CC-BY-4.0. By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.
- Downloads last month
- 1,546