Han Yang

yaanhaan

AI & ML interests

None yet

Recent Activity

commented on an article 6 days ago
Train a Llama model from scratch
liked a dataset 25 days ago
hf-doc-build/doc-build
updated a model 9 months ago
yaanhaan/codeparrot-ds
View all activity

Organizations

None yet

yaanhaan's activity

commented on Train a Llama model from scratch 6 days ago
view reply

Dear Roggendorff,
thank you for sharing :),

it showed me an error TypeError: Input must be a List[Union[str, AddedToken]] when I was using add_special_tokens.

I just noticed that the tokenizer is an instance of the class ByteLevelBPETokenizer, and the method add_special_tokens, the argument should be in type List[Union[str, AddedToken]].
link. Is this an update of the API? or should we directly use transformers.LlamaTokenizer instead, which inherite from PreTrainedTokenizer?

specifically, it is

from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
special_tokens = {
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
new_tokenizer.add_special_tokens(special_tokens)
new_tokenizer.save_pretrained("./tokenizer")

and the type of the new_tokenizer is transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast
is it correct?

thanks
Han

New activity in Helsinki-NLP/opus-mt-de-en about 2 years ago

Nein -> Yes

#2 opened about 2 years ago by
yaanhaan