How can I align timesteps to text for Parakeet-tdt-0.6b-v2 output using KenLM?
Thanks to the NeMo team for the SOTA model! When I tried to run inference on the Parakeet ASR model with KenLM, I got an output like this:
Hypothesis(score=-4.529468536376953, y_sequence=tensor([1024, 223, 224, 5, 709, 50, 9, 172, 309, 64, 5, 168,
167, 840, 822, 239, 839, 5, 147, 840, 821, 59, 819, 862,
882, 15, 131, 55, 229, 131, 55, 39, 148, 4, 826, 30,
104, 326, 841]), text='because by the final square on the chessboard, the debt is 18 billion trillion grains of rice.', dec_out=None, dec_state=(tensor([[[ 4.2064e-02, 1.4051e-02, -5.5140e-01, ..., -2.0446e-01,
-1.0781e-04, -5.2153e-05]],
[[-1.3940e-02, 4.0917e-02, 5.9655e-02, ..., 8.7109e-02,
-4.7848e-03, 5.0823e-02]]]), tensor([[[ 0.9999, 0.0141, -0.6204, ..., -0.3344, -0.0409, -0.0473]],
[[-0.0511, 0.1249, 0.2977, ..., 1.0271, -2.0819, 0.1519]]])), timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=tensor(164, device='cuda:0'), y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=164)
But i'dont know how can i get align text output with timestep?
@Nguyen667201 What decoding strategy are you using?
Please note that word-level KenLM models are not supported for TDT.
You can use token-level language models:
- with greedy (
greedy_batch
strategy), see description in https://github.com/NVIDIA/NeMo/pull/10989 - with beam search (
malsd_batch
strategy) see description in https://github.com/NVIDIA/NeMo/pull/12729
For greedy (with/without LM) you can get timestamps.
hypotheses = asr_model.transcribe(
audio=["6313-76958-0000.wav"],
timestamps=True,
)
hyp = hypotheses[0]
print("Text:", hyp.text)
print("Tokens:", hyp.y_sequence)
print("Token-level timestamps:", hyp.timestamp["timestep"])
print("Char-level timestamps:", hyp.timestamp["char"][:5]) # output is too long, print only first 5 elements
print("Word-level timestamps:", hyp.timestamp["word"])
print("Segment-level timestamps:", hyp.timestamp["segment"])
Output:
Text: CHAPTER four THE FIRST NIGHT IN CAMP
Tokens: tensor([140, 859, 847, 863, 846, 871, 870, 801, 65, 859, 871, 237, 845, 870,
848, 846, 212, 845, 868, 859, 846, 34, 867, 140, 847, 858, 863])
Token-level timestamps: [2, 3, 4, 5, 5, 5, 5, 7, 12, 13, 13, 15, 16, 17, 18, 19, 21, 22, 23, 23, 24, 25, 26, 28, 29, 30, 31]
Char-level timestamps: [{'char': ['C'], 'start_offset': 2, 'end_offset': 3, 'start': 0.16, 'end': 0.24}, {'char': ['H'], 'start_offset': 3, 'end_offset': 4, 'start': 0.24, 'end': 0.32}, {'char': ['A'], 'start_offset': 4, 'end_offset': 5, 'start': 0.32, 'end': 0.4}, {'char': ['P'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}, {'char': ['T'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}]
Word-level timestamps: [{'word': 'CHAPTER', 'start_offset': 2, 'end_offset': 5, 'start': 0.16, 'end': 0.4}, {'word': 'four', 'start_offset': 7, 'end_offset': 9, 'start': 0.56, 'end': 0.72}, {'word': 'THE', 'start_offset': 12, 'end_offset': 14, 'start': 0.96, 'end': 1.12}, {'word': 'FIRST', 'start_offset': 15, 'end_offset': 20, 'start': 1.2, 'end': 1.6}, {'word': 'NIGHT', 'start_offset': 21, 'end_offset': 24, 'start': 1.68, 'end': 1.92}, {'word': 'IN', 'start_offset': 25, 'end_offset': 27, 'start': 2.0, 'end': 2.16}, {'word': 'CAMP', 'start_offset': 28, 'end_offset': 32, 'start': 2.24, 'end': 2.56}]
Segment-level timestamps: [{'segment': 'CHAPTER four THE FIRST NIGHT IN CAMP', 'start_offset': 2, 'end_offset': 32, 'start': 0.16, 'end': 2.56}]
For beam search, timestamps are not available yet.
Hi @artbataev , I'm using decoding_cfg.strategy = "beam" for inference, but I'm getting timesteps like timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156] in the output. What do they mean? Can I leverage them to map to the word level, or should I use the word-level output from the greedy strategy to align with the KenLM output??
@Nguyen667201
As I see in the code, the "beam" strategy (for both RNN-T and TDT) ignores n-gram LM models. You are actually using beam search without LM.
https://github.com/NVIDIA/NeMo/blob/v2.3.1/nemo/collections/asr/parts/submodules/rnnt_decoding.py#L394
We will add an error in the future to check that n-gram LM is not passed in this case.
If you need beam search with TDT + LM, you can use:
- "maes" strategy (available in v2.3.1) https://github.com/NVIDIA/NeMo/blob/v2.3.1/nemo/collections/asr/parts/submodules/rnnt_decoding.py#L446
- "malsd_batch" strategy (recommended, many times faster; see PR link above, but it is available only in the main branch and will be part of the next release)
Regarding timestamps that you see in the hypothesis:
y_sequence=[1024, 223, 224, 5, 709, 50, 9, 172, 309, 64, 5, 168, 167, 840, 822, 239, 839, 5, 147, 840, 821, 59, 819, 862, 882, 15, 131, 55, 229, 131, 55, 39, 148, 4, 826, 30, 104, 326, 841]
timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156]
len(y_sequence) == 39
len(timestep) == 38
The first symbol in y_sequence is 1024
, which is a blank symbol. You can ignore it.
Other symbols (starting from 223
) are decoded BPE tokens. You can use asr_model.tokenizer
(.ids_to_tokens(...)
or .ids_to_text(...)
) to decode them and get the actual transcript.timestep
is an array with the frame numbers where the tokens were found (without first blank token)
Given 80 ms per frame (10 ms features step, 8x encoder subsampling), you can get time in seconds from the start of the utterance (0.08*15=1.2 for the token "223", etc.)