Hi Everyone,

I am trying to finetune Parakeet v2 TDT to GramVani dataset link. Here is the configuration I am using
https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/hindi_config.yaml

And the training script is available here https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/finetune.py
The script is the usual fine-tuning script

The full code is available here

Here are some of the logs during training I trained only till 3 epochs.

  | Name              | Type                              | Params | Mode 
--------------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0      | train
1 | encoder           | ConformerEncoder                  | 608 M  | eval 
2 | spec_augmentation | SpectrogramAugmentation           | 0      | train
3 | wer               | WER                               | 0      | train
4 | joint             | RNNTJoint                         | 1.7 M  | train
5 | decoder           | RNNTDecoder                       | 7.2 M  | train
6 | loss              | RNNTLoss                          | 0      | train
7 | spec_augment      | SpectrogramAugmentation           | 0      | train
--------------------------------------------------------------------------------
9.0 M     Trainable params
608 M     Non-trainable params
617 M     Total params
2,471.304 Total estimated model params size (MB)
46        Modules in train mode
662       Modules in eval mode
Epoch 0:   0%|                                         | 0/9200 [00:00<?, ?it/s][NeMo I 2025-06-03 13:03:28 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:03:28 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
[NeMo W 2025-06-03 13:03:30 nemo_logging:405] Provided RNNT Joint tensor is of dtype torch.float16, but RNNT loss could not be calculated in fp16 due to following reason stated below. Loss will be calculated in fp32. 

NeMo I 2025-06-03 13:15:33 nemo_logging:393] reference:के चक्कर में सूती सारी ले होती है इसलिए कोयला की सुविधा हम झारखण्ड सरकार ऐसी कहेंगे की कोयला की सुविधा बढ़ाने लिए
[NeMo I 2025-06-03 13:15:33 nemo_logging:393] predicted:बारिश पू ग्राम के ब्यंग हुई ने खास आदाब आपकेेशनहचनालहह्ग मेंबी ख़ ख़ालतहगन निकाल निकालग निकाल निकाल्सलस है का का का का का कासcompधसletगगग माम सं कि ऐसी M का ख़ंह सकती होते्य सं हैं है के है का है
Epoch 0:  43%|▍| 3999/9200 [24:06<31:21,  2.76it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:27:35 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:27:35 nemo_logging:393] reference:गहरे पानी के अलावा ब्लीचिंग पाउडर का छिडकाव करना सफाई करना गहरे पानी पे कोई नहीं जाए इसलिए नागरिकों की रक्षा करना
[NeMo I 2025-06-03 13:27:35 nemo_logging:393] predicted:है वाणीते है को है की के का में है की में दो के को है को है की को वाणी के है न हहम है को है
Epoch 0:  65%|▋| 5999/9200 [36:05<19:15,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:39:34 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:39:34 nemo_logging:393] reference:तो चलिए सुनते है नया कार्यक्रम
[NeMo I 2025-06-03 13:39:34 nemo_logging:393] predicted:नमस्कार के लिए के लिए रही केेे के में कोजस के लिए को को की को की को को की को और को
Epoch 0:  87%|▊| 7999/9200 [48:07<07:13,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:51:36 nemo_logging:393] 
    
[NeMo I 2025-06-03 13:51:36 nemo_logging:393] reference:ज़बरन शादी करा दी जा रही है बच्चों के अधिसूचित अधिकारों पे काम करने वाली अंतराष्ट्रीय
[NeMo I 2025-06-03 13:51:36 nemo_logging:393] predicted:नमस्कार मैं को और को की को की को है को को को को को और के लिए
Epoch 0: 100%|█| 9200/9200 [55:21<00:00,  2.77it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Enabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Enabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
Epoch 1:   0%| | 0/9200 [00:00<?, ?it/s, v_num=2-55, train_step_timing in s=0.43[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.models.rnnt_bpe_models.EncDecRNNTBPEModel'>.decoding.decoding
[NeMo I 2025-06-03 13:58:50 nemo_logging:393] Disabled CUDA graphs for module <class 'nemo.collections.asr.metrics.wer.WER'>wer.decoding.decoding
Epoch 1:   9%| | 799/9200 [04:51<51:07,  2.74it/s, v_num=2-55, train_step_timing[NeMo I 2025-06-03 14:03:42 nemo_logging:393] 
    
[NeMo I 2025-06-03 14:03:42 nemo_logging:393] reference:आप व अपनी राय या प्रतिक्रिया दे सकते हैं नों तीन दबा का हमें आपकी प्रतिक्रिया का इंतेज़ार रहेगा
[NeMo I 2025-06-03 14:03:42 nemo_logging:393] predicted:नमस्कार आदाब को और को और को और को को को
Epoch 1:  30%|▎| 2799/9200 [16:57<38:47,  2.75it/s, v_num=2-55, train_step_timin[NeMo I 2025-06-03 14:15:47 nemo_logging:393]

Here is my training loss

Here is my WER plot of only the training batch.

As we can see we have three problems

WER is very poor.
1 epoch is taking a 56 mins on a P100 GPU with 16 GB VRAM. The model is training just 9 million parameters with encoder layer freezed.
The memory occupied in around 11 GB with just a batch size of 4.

Problem 1 High WER in the training itself.

We haven't even evaluated the validation WER and WER is high in training.

I suspected whether the bpe encoding scheme is not correctly applied or not. So I tested one word.

import sentencepiece as spm
vocab_file = 'tokenizer_output/vocab.txt'
model_prefix = 'tokenizer_output/tokenizer'
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

test_text = "नमस्कार मैं दीपक कुमार सिंह"
encoded = sp.encode_as_pieces(test_text)
print(f"\nTest encoding:")
print(f"Original: {test_text}...")
print(f"Encoded: {encoded}...")

I got

Test encoding:
Original: नमस्कार मैं दीपक कुमार सिंह...
Encoded: ['▁नमस्कार', '▁मैं', '▁दी', 'प', 'क', '▁कुमार', '▁सिंह']...

So I think it is working.

The bpe encoding code I am using is this

import sentencepiece as spm
import json
import os
from glob import glob

# Create output directory
os.makedirs('tokenizer_output', exist_ok=True)

# Extract texts from manifest
texts = []
with open('train_manifest.json', 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line.strip())
        if 'text' in data and data['text'].strip():
            texts.append(data['text'])

print(f"Found {len(texts)} texts for training")

# Save texts to document.txt (raw corpus)
document_file = 'tokenizer_output/document.txt'
with open(document_file, 'w', encoding='utf-8') as f:
    for text in texts:
        f.write(text + '\n')
print(f"Saved raw text corpus to {document_file}")

# Train SentencePiece model
model_prefix = 'tokenizer_output/tokenizer'
spm.SentencePieceTrainer.train(
    input=document_file,  # Now using document.txt directly
    model_prefix=model_prefix,
    vocab_size=1024,
    model_type='bpe',
    character_coverage=0.9995,
    normalization_rule_name='identity',
    remove_extra_whitespaces=False,
    max_sentence_length=4192,
    shuffle_input_sentence=True
)

print(f"Tokenizer saved as {model_prefix}.model and {model_prefix}.vocab")

# Create human-readable vocab.txt
vocab_file = 'tokenizer_output/vocab.txt'
sp = spm.SentencePieceProcessor()
sp.load(f'{model_prefix}.model')

with open(vocab_file, 'w', encoding='utf-8') as f:
    for i in range(sp.get_piece_size()):
        piece = sp.id_to_piece(i)
        f.write(f"{piece}\n")
print(f"Saved human-readable vocabulary to {vocab_file}")

It is using all the corpus available in the training set so no question of out of vocabulary words.

The next thing that is coming to my mind is un freezing the encoder layer. Maybe that could improve WER.
Then just increase the number of epochs let's say atleast 100.
Or increase the batch size from 4 to 16 like the original code (If my memory allows)
Change the augmentation parameters during training.

But more important question is whether the model will work languages other than English where data available
is only 100 hours.

Problem 2 Slow Speed

One Epoch is taking around 56 mins on P100 GPU with 16 GB VRAM. Considering the number of trainable parameters of 9 million it seems slow to me.

Problem 3 Large Memory Occupancy

According to me 16K sampling rate with average duration of 15 seconds and batch size 16 and each amplitude value taking 4 bytes gives around 15 MB. But memory occupancy exceeds beyond 16 GB which forced me to use batch size of 4. Any clue why this happens? Also Is there any tool that gives me all the memory profiles of the GPU along with training logs?

nvidia
/

parakeet-tdt-0.6b-v2

Poor WER when trying to fine-tune Parakeet v2 TDT to other dataset than English

Problem 1 High WER in the training itself.

Problem 2 Slow Speed

Problem 3 Large Memory Occupancy