Error with Testing

#1
by Owos - opened

testing this model directly from my computer gives this error, is there a way we can fix this?
Can't load tokenizer using from_pretrained, please update its configuration: Can't load tokenizer for 'sanchit-gandhi/whisper-small-hi'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'sanchit-gandhi/whisper-small-hi' is the correct path to a directory containing all relevant files for a WhisperTokenizer tokenizer.

Thanks for flagging! The model was missing the tokenizer files. Resolved with https://huggingface.co/sanchit-gandhi/whisper-small-hi/commit/e4e67782fdea6089f7939884af74e7d735f79b00 It should work now!

sanchit-gandhi changed discussion status to closed

Wow, the changes seem to be a lot. Is there a way I can automatically do this from the notebook (I'm using the step by step process in the colab notebook)?
And just a follow up, the notebook did not indicate how to change the language to something else. I just changed hindi to english and 'hi' to en. Is that correct?

Owos changed discussion status to open

Yep! The simplest way is to save the processor object before training. I've updated the Google Colab:

https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

See section "Define the Training Configuration".

That's correct! Just switch the language in the processor and dataset:

processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="English", task="transcribe") 

Out of interest, are you training a model for English speech recognition? If so, you can use the small.en checkpoint, and omit the language and task args from the processor.

Hi, I just checked the notebook you tagged. It is the same with the old one. If you are referring to the cell with this code: processor.save_pretrained(training_args.output_dir), it was in the former one too.
I added this cell after trainer.push_to_hub(**kwargs) to automatically push tokenizers to the hub: processor.push_to_hub(repo_id='sanchit-gandhi/whisper-small-hi', commit_message = 'added tokiner', use_auth_token='xxxxxx')
You can added it to notebook for people that would face the same issues.
I will close the issue now.

Owos changed discussion status to closed

Thanks @Owos for the fix 🙌

Sign up or log in to comment