How can I align timesteps to text for Parakeet-tdt-0.6b-v2 output using KenLM?

#48
by Nguyen667201 - opened

Thanks to the NeMo team for the SOTA model! When I tried to run inference on the Parakeet ASR model with KenLM, I got an output like this:

Hypothesis(score=-4.529468536376953, y_sequence=tensor([1024, 223, 224, 5, 709, 50, 9, 172, 309, 64, 5, 168,
167, 840, 822, 239, 839, 5, 147, 840, 821, 59, 819, 862,
882, 15, 131, 55, 229, 131, 55, 39, 148, 4, 826, 30,
104, 326, 841]), text='because by the final square on the chessboard, the debt is 18 billion trillion grains of rice.', dec_out=None, dec_state=(tensor([[[ 4.2064e-02, 1.4051e-02, -5.5140e-01, ..., -2.0446e-01,
-1.0781e-04, -5.2153e-05]],

    [[-1.3940e-02,  4.0917e-02,  5.9655e-02,  ...,  8.7109e-02,
      -4.7848e-03,  5.0823e-02]]]), tensor([[[ 0.9999,  0.0141, -0.6204,  ..., -0.3344, -0.0409, -0.0473]],

    [[-0.0511,  0.1249,  0.2977,  ...,  1.0271, -2.0819,  0.1519]]])), timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156], alignments=None, frame_confidence=None, token_confidence=None, word_confidence=None, length=tensor(164, device='cuda:0'), y=None, lm_state=None, lm_scores=None, ngram_lm_state=None, tokens=None, last_token=None, token_duration=None, last_frame=164)

But i'dont know how can i get align text output with timestep?

@Nguyen667201 What decoding strategy are you using?

Please note that word-level KenLM models are not supported for TDT.
You can use token-level language models:

For greedy (with/without LM) you can get timestamps.

hypotheses = asr_model.transcribe(
    audio=["6313-76958-0000.wav"],
    timestamps=True,
)
hyp = hypotheses[0]

print("Text:", hyp.text)
print("Tokens:", hyp.y_sequence)
print("Token-level timestamps:", hyp.timestamp["timestep"])
print("Char-level timestamps:", hyp.timestamp["char"][:5])  # output is too long, print only first 5 elements
print("Word-level timestamps:", hyp.timestamp["word"])
print("Segment-level timestamps:", hyp.timestamp["segment"])

Output:

Text: CHAPTER four THE FIRST NIGHT IN CAMP
Tokens: tensor([140, 859, 847, 863, 846, 871, 870, 801,  65, 859, 871, 237, 845, 870,
        848, 846, 212, 845, 868, 859, 846,  34, 867, 140, 847, 858, 863])
Token-level timestamps: [2, 3, 4, 5, 5, 5, 5, 7, 12, 13, 13, 15, 16, 17, 18, 19, 21, 22, 23, 23, 24, 25, 26, 28, 29, 30, 31]
Char-level timestamps: [{'char': ['C'], 'start_offset': 2, 'end_offset': 3, 'start': 0.16, 'end': 0.24}, {'char': ['H'], 'start_offset': 3, 'end_offset': 4, 'start': 0.24, 'end': 0.32}, {'char': ['A'], 'start_offset': 4, 'end_offset': 5, 'start': 0.32, 'end': 0.4}, {'char': ['P'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}, {'char': ['T'], 'start_offset': 5, 'end_offset': 5, 'start': 0.4, 'end': 0.4}]
Word-level timestamps: [{'word': 'CHAPTER', 'start_offset': 2, 'end_offset': 5, 'start': 0.16, 'end': 0.4}, {'word': 'four', 'start_offset': 7, 'end_offset': 9, 'start': 0.56, 'end': 0.72}, {'word': 'THE', 'start_offset': 12, 'end_offset': 14, 'start': 0.96, 'end': 1.12}, {'word': 'FIRST', 'start_offset': 15, 'end_offset': 20, 'start': 1.2, 'end': 1.6}, {'word': 'NIGHT', 'start_offset': 21, 'end_offset': 24, 'start': 1.68, 'end': 1.92}, {'word': 'IN', 'start_offset': 25, 'end_offset': 27, 'start': 2.0, 'end': 2.16}, {'word': 'CAMP', 'start_offset': 28, 'end_offset': 32, 'start': 2.24, 'end': 2.56}]
Segment-level timestamps: [{'segment': 'CHAPTER four THE FIRST NIGHT IN CAMP', 'start_offset': 2, 'end_offset': 32, 'start': 0.16, 'end': 2.56}]

For beam search, timestamps are not available yet.

Hi @artbataev , I'm using decoding_cfg.strategy = "beam" for inference, but I'm getting timesteps like timestep=[8, 15, 19, 23, 27, 59, 62, 65, 67, 69, 71, 74, 75, 77, 79, 81, 87, 88, 90, 93, 97, 99, 101, 104, 117, 118, 119, 125, 126, 128, 133, 135, 136, 138, 140, 144, 148, 156] in the output. What do they mean? Can I leverage them to map to the word level, or should I use the word-level output from the greedy strategy to align with the KenLM output??

Sign up or log in to comment