How to use SentencePiece tokenizer with the repo
#4
by
AndreiAksionov
- opened
Hey there.
I have a question: in the repo there is only one file for a tokenizer - tokenizer.model
.
As I understand, it's a file for SentencePiece tokenizer.
The problem is that there are additional tokens for the instruct variant in added_tokens.json
and it might be tricky to extend an already pretrained SP tokenizer.
I know that AutoTokenizer can deal with that, but what if I want to use SP tokenizer for the task (since a file for it exists)?
Or am I digging too deep and there is an easier way to use a tokenizer for this model?