yaanhaan (Han Yang)

commented on Train a Llama model from scratch 4 months ago

Dear Roggendorff,
thank you for sharing :),

it showed me an error TypeError: Input must be a List[Union[str, AddedToken]] when I was using add_special_tokens.

I just noticed that the tokenizer is an instance of the class ByteLevelBPETokenizer, and the method add_special_tokens, the argument should be in type List[Union[str, AddedToken]].
link. Is this an update of the API? or should we directly use transformers.LlamaTokenizer instead, which inherite from PreTrainedTokenizer?

specifically, it is

from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
new_tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 32000)
special_tokens = {
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "pad_token": "<pad>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<|user|>", "<|bot|>", "<|end|>"] # same here
}
new_tokenizer.add_special_tokens(special_tokens)
new_tokenizer.save_pretrained("./tokenizer")

and the type of the new_tokenizer is transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast
is it correct?

thanks
Han

liked a dataset 5 months ago

hf-doc-build/doc-build

Updated 6 minutes ago • 331k • 10

updated a model about 1 year ago

yaanhaan/codeparrot-ds

Updated May 23, 2024

updated a dataset almost 2 years ago

yaanhaan/Baby-CoThought-Data

Viewer • Updated Jul 30, 2023 • 11.5M • 12 • 1

liked a model almost 2 years ago

yaanhaan/Baby-CoThought

Fill-Mask • Updated Jul 30, 2023 • 13 • 3

updated a model almost 2 years ago

yaanhaan/Baby-CoThought

Fill-Mask • Updated Jul 30, 2023 • 13 • 3

Han Yang

AI & ML interests

Recent Activity

Organizations

hf-doc-build/doc-build

yaanhaan/codeparrot-ds

yaanhaan/Baby-CoThought-Data

yaanhaan/Baby-CoThought

yaanhaan/Baby-CoThought

Han Yang

AI & ML interests

Recent Activity

Organizations

yaanhaan's activity