tokenizer: setting eos_token in tokenizer_config.json not working

#2
by g-ronimo - opened

the eos_token set in tokenizer_config.json is not respected when loading the tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    modelpath,
    trust_remote_code=True,
    use_fast=False,
) 
tokenizer.eos_token

output: '<|endoftext|>'

tokenizer_config.json:

{
  "added_tokens_decoder": {},
  "auto_map": {
    "AutoTokenizer": [
      "tokenization_arcade100k.Arcade100kTokenizer",
      null
    ]
  },
  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
  "clean_up_tokenization_spaces": true,
  "errors": "replace",
  "model_max_length": 2048,
  "pad_token": "<|endoftext|>",
  "eos_token": "<|im_end|>",
  "tokenizer_class": "Arcade100kTokenizer"
}

setting eos_token to something random errors when loading the tokenizer, so the information is processed somewhere but tokenizer.eos_token is always the same, no matter what I set (as long as it exists in the voab)

model isstabilityai/stablelm-2-zephyr-1_6b finetuned on ChatML conversations
thank you for this great model!

Stability AI org

Hi, @g-ronimo ! Thanks for bringing this up! The eos_token was hardcoded in the tokenizer constructor. It's been updated to allow for overridability. Let us know if you run into any further issues πŸ™

jon-tow changed discussion status to closed

Sign up or log in to comment