Error with Testing
testing this model directly from my computer gives this error, is there a way we can fix this?
Can't load tokenizer using from_pretrained, please update its configuration: Can't load tokenizer for 'sanchit-gandhi/whisper-small-hi'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'sanchit-gandhi/whisper-small-hi' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.
Thanks for flagging! The model was missing the tokenizer files. Resolved with https://huggingface.co/sanchit-gandhi/whisper-small-hi/commit/e4e67782fdea6089f7939884af74e7d735f79b00 It should work now!
Wow, the changes seem to be a lot. Is there a way I can automatically do this from the notebook (I'm using the step by step process in the colab notebook)?
And just a follow up, the notebook did not indicate how to change the language to something else. I just changed hindi
to english
and 'hi' to en
. Is that correct?
Yep! The simplest way is to save the processor object before training. I've updated the Google Colab:
https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb
See section "Define the Training Configuration".
That's correct! Just switch the language in the processor and dataset:
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="English", task="transcribe")
Out of interest, are you training a model for English speech recognition? If so, you can use the small.en
checkpoint, and omit the language and task args from the processor.
Hi, I just checked the notebook you tagged. It is the same with the old one. If you are referring to the cell with this code: processor.save_pretrained(training_args.output_dir)
, it was in the former one too.
I added this cell after trainer.push_to_hub(**kwargs)
to automatically push tokenizers to the hub: processor.push_to_hub(repo_id='sanchit-gandhi/whisper-small-hi', commit_message = 'added tokiner', use_auth_token='xxxxxx')
You can added it to notebook for people that would face the same issues.
I will close the issue now.
Thanks @Owos for the fix 🙌