Discrepancy in Special Tokens Between phi-3-mini and Llama-2 Tokenizer
#97
by
sootung
- opened
Hello,
I noticed a discrepancy between the special tokens used in the phi-3-mini/phi-3-mini-128k tokenizer and what is described in the technique report (https://arxiv.org/pdf/2404.14219). According to the report, phi-3-mini/phi-3-mini-128k and Llama-2 should use the same tokenizer. However, I observed that the special_tokens_map.json file in the Hugging Face repository for the phi-3-mini model contains an eos_token("<|endoftext|>") that differs from the one used in Llama-2("</ s>").
Could you please clarify why there is a difference between the eos_token in phi-3-mini and the one used in Llama-2, and whether this was intentional or an oversight?
Thank you for your help!
Best regards.