Error with Testing

by Owos - opened Nov 16, 2022

Owos

Nov 16, 2022

testing this model directly from my computer gives this error, is there a way we can fix this?
Can't load tokenizer using from_pretrained, please update its configuration: Can't load tokenizer for 'sanchit-gandhi/whisper-small-hi'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'sanchit-gandhi/whisper-small-hi' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.

sanchit-gandhi

Owner Nov 17, 2022

Thanks for flagging! The model was missing the tokenizer files. Resolved with https://huggingface.co/sanchit-gandhi/whisper-small-hi/commit/e4e67782fdea6089f7939884af74e7d735f79b00 It should work now!

sanchit-gandhi changed discussion status to closed Nov 17, 2022

Owos

Nov 17, 2022

•

edited Nov 17, 2022

Wow, the changes seem to be a lot. Is there a way I can automatically do this from the notebook (I'm using the step by step process in the colab notebook)?
And just a follow up, the notebook did not indicate how to change the language to something else. I just changed hindi to english and 'hi' to en. Is that correct?

Owos changed discussion status to open Nov 17, 2022

sanchit-gandhi

Owner Nov 17, 2022

•

edited Nov 17, 2022

Yep! The simplest way is to save the processor object before training. I've updated the Google Colab:

https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

See section "Define the Training Configuration".

That's correct! Just switch the language in the processor and dataset:

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="English", task="transcribe")

Out of interest, are you training a model for English speech recognition? If so, you can use the small.en checkpoint, and omit the language and task args from the processor.

Owos

Nov 19, 2022

Hi, I just checked the notebook you tagged. It is the same with the old one. If you are referring to the cell with this code: processor.save_pretrained(training_args.output_dir), it was in the former one too.
I added this cell after trainer.push_to_hub(**kwargs) to automatically push tokenizers to the hub: processor.push_to_hub(repo_id='sanchit-gandhi/whisper-small-hi', commit_message = 'added tokiner', use_auth_token='xxxxxx')
You can added it to notebook for people that would face the same issues.
I will close the issue now.

Owos changed discussion status to closed Nov 19, 2022

sanchit-gandhi

Owner Nov 21, 2022

Thanks @Owos for the fix 🙌

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment