Discrepancy in Special Tokens Between phi-3-mini and Llama-2 Tokenizer

#97
by sootung - opened

Hello,

I noticed a discrepancy between the special tokens used in the phi-3-mini/phi-3-mini-128k tokenizer and what is described in the technique report (https://arxiv.org/pdf/2404.14219). According to the report, phi-3-mini/phi-3-mini-128k and Llama-2 should use the same tokenizer. However, I observed that the special_tokens_map.json file in the Hugging Face repository for the phi-3-mini model contains an eos_token("<|endoftext|>") that differs from the one used in Llama-2("</ s>").

Could you please clarify why there is a difference between the eos_token in phi-3-mini and the one used in Llama-2, and whether this was intentional or an oversight?

Thank you for your help!

Best regards.

Sign up or log in to comment