Dear Roggendorff,
thank you for sharing :),
it showed me an error TypeError: Input must be a List[Union[str, AddedToken]]
when I was using add_special_tokens
.
I just noticed that the tokenizer is an instance of the class ByteLevelBPETokenizer
, and the method add_special_tokens
, the argument should be in type List[Union[str, AddedToken]]
.
link. Is this an update of the API? or should we directly use transformers.LlamaTokenizer
instead, which inherite from PreTrainedTokenizer
?
specifically, it is
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
special_tokens = {
"bos_token": "<s>",
"eos_token": "</s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>",
"additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
new_tokenizer.add_special_tokens(special_tokens)
new_tokenizer.save_pretrained("./tokenizer")
and the type of the new_tokenizer
is transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast
is it correct?
thanks
Han