Convert hugging face tokenizer.json to tokenizer.model - inference failed

#4
by kannan678 - opened

I am trying to convert this tokenizer.json into tokenizer.model inorder to run Karpathy's llama2.c - https://github.com/karpathy/llama2.c/

I tried the following steps:

  1. Extract vocabulary from tokenizer.json
  2. Train the sentencepiece tokenizer using spm_train with the extracted vocabulary (vocab_size = 32000). This generates tokenizer.model
  3. Use tokenizer.py to convert the tokenizer.model to tokenizer.bin.

Even though the above steps resulted in the generation of tokenizer.model, the inference wasnot successful. I expected the model to generate a tiny story (because the model is trained on TinyStories dataset), but the output I got was random gibberish sentences.
I assume this has something to do with the tokenizer.model that was generated.

My question is: Can hugging face tokenizer be converted to tokenizer.model, and used with llama2.c?If yes, how can this be done?
If anyone could assist with this, it would be really helpful.

Not sure how conversion works, I have a similar thread in reverse that was never solved:
https://github.com/karpathy/llama2.c/issues/411

However, since I imported this model from llama2.c, I think you should be able to just use the default tokenizer from the llama2.c repo and I assume it would work fine.

Yes, it's works fine with the tokenizer.bin from the git repo https://github.com/nickypro/llama2.c.git. But I need to work with different hugging face llama models and tokenizers range from 15M to 3B. Also I wrote a simple python to convert the tokenizer.json to tokenizer.bin in the format in which the llama C code is expecting. Got it working, but there arises another issue,

I read that, In llama, words start with ▁ (underscore), which represents a space before the word. This is part of training, llama learned to treat ▁Hello as " Hello", but the tokenizer encodes that as a special token. So it must preserve "▁" in vocab file and tokens, otherwise, the model will not tokenize inputs correctly or decode outputs to the right words. But when converting token IDs back to strings (i.e. generation output), you should convert "▁" to a real space " ". But I haven't seen any replacing logic in run.c for "_" to " ". Is this excluded due to some special case?

Yeah the tokenizer for Llama2 specifically seems to be quite inconsistent with the way it handles spaces. My friend made a blog post about his issues with it here: https://davidquarel.github.io/2024/10/01/tokenizer-bad.html

Maybe you are getting similar issues?

My understanding is that the same tokenizer was used for 15M as the original Llama 2 model family but I haven't confirmed this.

Sign up or log in to comment