nvidia/parakeet-tdt-0.6b-v2 · How can I align timesteps to text for Parakeet-tdt-0.6b-v2 output using KenLM?

How can I align timesteps to text for Parakeet-tdt-0.6b-v2 output using KenLM?

#48

by Nguyen667201 - opened Jun 12

Jun 12

Thanks to the NeMo team for the SOTA model! When I tried to run inference on the Parakeet ASR model with KenLM, I got an output like this:

Hypothesis(score=-4.529468536376953, y_sequence=tensor([1024, 223, 224, 5, 709, 50, 9, 172, 309, 64, 5, 168,
167, 840, 822, 239, 839, 5, 147, 840, 821, 59, 819, 862,
882, 15, 131, 55, 229, 131, 55, 39, 148, 4, 826, 30,
104, 326, 841]), text='because by the final square on the chessboard, the debt is 18 billion trillion grains of rice.', dec_out=None, dec_state=(tensor([[[ 4.2064e-02, 1.4051e-02, -5.5140e-01, ..., -2.0446e-01,
-1.0781e-04, -5.2153e-05]],

    [[-1.3940e-02,  4.0917e-02,  5.9655e-02,  ...,  8.7109e-02,
      -4.7848e-03,  5.0823e-02]]]), tensor([[[ 0.9999,  0.0141, -0.6204,  ..., -0.3344, -0.0409, -0.0473]],

    [[-0.0511,  0.1249,  0.2977,  ...,  1.0271, -2.0819,  0.1519]]])), timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=tensor(164, device='cuda:0'), y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=164)

But i'dont know how can i get align text output with timestep?

artbataev

NVIDIA org Jun 16

•

edited Jun 16

@Nguyen667201 What decoding strategy are you using?

Please note that word-level KenLM models are not supported for TDT.
You can use token-level language models:

with greedy (greedy_batch strategy), see description in https://github.com/NVIDIA/NeMo/pull/10989
with beam search (malsd_batch strategy) see description in https://github.com/NVIDIA/NeMo/pull/12729

For greedy (with/without LM) you can get timestamps.

hypotheses = asr_model.transcribe(
    audio=["6313-76958-0000.wav"],
    timestamps=True,
)
hyp = hypotheses[0]

print("Text:", hyp.text)
print("Tokens:", hyp.y_sequence)
print("Token-level timestamps:", hyp.timestamp["timestep"])
print("Char-level timestamps:", hyp.timestamp["char"][:5])  # output is too long, print only first 5 elements
print("Word-level timestamps:", hyp.timestamp["word"])
print("Segment-level timestamps:", hyp.timestamp["segment"])

Output:

Text: CHAPTER four THE FIRST NIGHT IN CAMP
Tokens: tensor([140, 859, 847, 863, 846, 871, 870, 801,  65, 859, 871, 237, 845, 870,
        848, 846, 212, 845, 868, 859, 846,  34, 867, 140, 847, 858, 863])
Token-level timestamps: [2, 3, 4, 5, 5, 5, 5, 7, 12, 13, 13, 15, 16, 17, 18, 19, 21, 22, 23, 23, 24, 25, 26, 28, 29, 30, 31]
Char-level timestamps: [{'char': ['C'], 'start_offset': 2, 'end_offset': 3, 'start': 0.16, 'end': 0.24}, {'char': ['H'], 'start_offset': 3, 'end_offset': 4, 'start': 0.24, 'end': 0.32}, {'char': ['A'], 'start_offset': 4, 'end_offset': 5, 'start': 0.32, 'end': 0.4}, {'char': ['P'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}, {'char': ['T'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}]
Word-level timestamps: [{'word': 'CHAPTER', 'start_offset': 2, 'end_offset': 5, 'start': 0.16, 'end': 0.4}, {'word': 'four', 'start_offset': 7, 'end_offset': 9, 'start': 0.56, 'end': 0.72}, {'word': 'THE', 'start_offset': 12, 'end_offset': 14, 'start': 0.96, 'end': 1.12}, {'word': 'FIRST', 'start_offset': 15, 'end_offset': 20, 'start': 1.2, 'end': 1.6}, {'word': 'NIGHT', 'start_offset': 21, 'end_offset': 24, 'start': 1.68, 'end': 1.92}, {'word': 'IN', 'start_offset': 25, 'end_offset': 27, 'start': 2.0, 'end': 2.16}, {'word': 'CAMP', 'start_offset': 28, 'end_offset': 32, 'start': 2.24, 'end': 2.56}]
Segment-level timestamps: [{'segment': 'CHAPTER four THE FIRST NIGHT IN CAMP', 'start_offset': 2, 'end_offset': 32, 'start': 0.16, 'end': 2.56}]

For beam search, timestamps are not available yet.

Nguyen667201

Jun 16

Hi @artbataev , I'm using decoding_cfg.strategy = "beam" for inference, but I'm getting timesteps like timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156] in the output. What do they mean? Can I leverage them to map to the word level, or should I use the word-level output from the greedy strategy to align with the KenLM output??

artbataev

NVIDIA org Jun 18

•

edited Jun 18

@Nguyen667201 As I see in the code, the "beam" strategy (for both RNN-T and TDT) ignores n-gram LM models. You are actually using beam search without LM.
https://github.com/NVIDIA/NeMo/blob/v2.3.1/nemo/collections/asr/parts/submodules/rnnt_decoding.py#L394
We will add an error in the future to check that n-gram LM is not passed in this case.

If you need beam search with TDT + LM, you can use:

"maes" strategy (available in v2.3.1) https://github.com/NVIDIA/NeMo/blob/v2.3.1/nemo/collections/asr/parts/submodules/rnnt_decoding.py#L446
"malsd_batch" strategy (recommended, many times faster; see PR link above, but it is available only in the main branch and will be part of the next release)

Regarding timestamps that you see in the hypothesis:

y_sequence=[1024, 223, 224, 5, 709, 50, 9, 172, 309, 64, 5, 168, 167, 840, 822, 239, 839, 5, 147, 840, 821, 59, 819, 862, 882, 15, 131, 55, 229, 131, 55, 39, 148, 4, 826, 30, 104, 326, 841]
timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156]
len(y_sequence) == 39
len(timestep) == 38

The first symbol in y_sequence is 1024, which is a blank symbol. You can ignore it.
Other symbols (starting from 223) are decoded BPE tokens. You can use asr_model.tokenizer (.ids_to_tokens(...) or .ids_to_text(...)) to decode them and get the actual transcript.
timestep is an array with the frame numbers where the tokens were found (without first blank token)
Given 80 ms per frame (10 ms features step, 8x encoder subsampling), you can get time in seconds from the start of the utterance (0.08*15=1.2 for the token "223", etc.)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment