Seeking a Clear Guide for Fine-Tuning NVIDIA NeMo Models on New English Audio Domains
I'm struggling to find a clear, beginner-friendly example of fine-tuning an NVIDIA NeMo model, such as parakeet-tdt-0.6b-v2
or canary-1b-flash
, on a new English audio domain, like a Medical or Legal ASR dataset. Most resources focus on transfer learning for new spoken languages, but I'm looking for guidance on continued pre-training with English data in a specific domain.
Ideally, I need an up-to-date, simple, robust, and extensible tutorial (e.g., supporting additions like data augmentation) for fine-tuning these pre-trained models. I’ve found limited documentation, forums, or YouTube tutorials addressing this specific use case.
Could you share links to relevant docs, notebooks, blogs, tutorials, or repositories that demonstrate this process clearly?
Thanks!
recent tutorial: https://developer.nvidia.com/blog/developing-robust-georgian-automatic-speech-recognition-with-fastconformer-hybrid-transducer-ctc-bpe/
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/configs.html#fine-tuning-configurations
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/examples/kinyarwanda_asr.html