Processing dataset?

#6
by Nguyen667201 - opened

Hi nemo team
Congrats on the work, I just want to try training model and I have a question.
with metadata for each data point (i've created by following tutorial):
Ex:
{"audio_filepath": "datasets/LibriLight/librispeech_finetuning/1h/2/clean/5778/12761/5778-12761-0005.flac", "duration": 10.945, "text": "but after our long wandering in rugged mountains where so frequently we had met with disappointments and where the crossing of every ridge displayed some unknown lake or river", "target_lang": "en", "source_lang": "en", "pnc": "False"}

How can I train the model to infer with timestamps?

Nguyen667201 changed discussion status to closed
Nguyen667201 changed discussion status to open

Hi, @Nguyen667201 Thanks for your interest! You need to produce text with alignments as the new target for training. E.g. "text": "<|3|> it's <|7|> <|8|> almost <|9|> <|14|> beyond <|20|> <|20|> conjecture <|28|>", as described in "3. Training on a new task: A case of timestamp prediction". To do that, you can use any aligner (e.g. NeMo force aligner: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tools/nemo_forced_aligner.html) to align between speech and text.

Our code relevant to read text with timestamp is in this PR: https://github.com/NVIDIA/NeMo/pull/11591/files

We will publish a paper soon for more details about training process and data.

Hi @huk-hf thanks for your help, i got it

Hi @huk-hf ,
I have tried training the model with both English and Vietnamese data. However, when I try to infer Vietnamese audio (assuming I set target_lang="fr" and source_lang="fr"), I still get a mix of Vietnamese and English transcription.
Does this mean that target_lang="fr" and source_lang="fr" are not actually being used during inference?
Could you please explain this to me?
image.png

@nithinraok Does the model support Vietnamese?

No, only those four mentioned languages. @Nguyen667201 maybe when you finetuned you probably have not used all combination of prompts and It could have affected the model performance.

NVIDIA org

Hi @Nguyen667201 , thank you for testing Canary models on a new language!

In the screenshot it looks like the Vietnamese audio has been transcribed in Vietnamese (correct me if I am wrong) and the English audio has English transcript. So, the model transcribes English utterance to English text even though both source_lang and target_lang are not English -- is that the unexpected part?

Since the model has not seen any English audio with source_lang=target_lang="fr" during training, it is hard to speculate what will it do on this unseen case during inference. It is possible that Canary model has learned some language ID inherently as a part of it's training and since the model has not been trained on English to Vietnamese translation, it defaults to transcribing in the original language being spoken.

Also a few pointers:

  1. The lang token "fr" is already used for French in Canary's training data. So, I'd recommend using a previously unused token for Vietnamese. You can see a list of available language tokens in the special tokenizer here.
  2. How much Vietnamese ASR data did you use for training? Maybe the amount of training data is not enough for the model to learn the function of source_lang=target_lang="fr".
  3. The way you are using source_lang and target_lang keys looks right to me, i.e. setting source_lang='en' and target_lang='en' for English ASR, and accordingly for any other language. Make sure that you are using source_lang and target_lang in the same way during training and inference.

Sign up or log in to comment