Llamafy tokenizer too?

#1
by deepak-banka - opened

Can you convert the tokenizer to Llama format as well?

I am trying to finetune this using unsloth but the tokenizer is not supported by unsloth.

Thank you for the great work.

Getting this error with unsloth:

silence09/InternLM3-8B-Instruct-Converted-LlaMA does not have a padding token! Will use pad_token = .

ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/convert_slow_tokenizer.py in convert_slow_tokenizer(transformer_tokenizer, from_tiktoken)
1635 additional_special_tokens=transformer_tokenizer.additional_special_tokens,
-> 1636 ).converted()
1637 except Exception:

11 frames
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/usr/local/lib/python3.11/dist-packages/transformers/convert_slow_tokenizer.py in convert_slow_tokenizer(transformer_tokenizer, from_tiktoken)
1636 ).converted()
1637 except Exception:
-> 1638 raise ValueError(
1639 f"Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path "
1640 f"with a SentencePiece tokenizer.model file."

ValueError: Converting from Tiktoken failed, if a converter for SentencePiece is available, provide a model path with a SentencePiece tokenizer.model file.Currently available slow->fast convertors: ['AlbertTokenizer', 'BartTokenizer', 'BarthezTokenizer', 'BertTokenizer', 'BigBirdTokenizer', 'BlenderbotTokenizer', 'CamembertTokenizer', 'CLIPTokenizer', 'CodeGenTokenizer', 'ConvBertTokenizer', 'DebertaTokenizer', 'DebertaV2Tokenizer', 'DistilBertTokenizer', 'DPRReaderTokenizer', 'DPRQuestionEncoderTokenizer', 'DPRContextEncoderTokenizer', 'ElectraTokenizer', 'FNetTokenizer', 'FunnelTokenizer', 'GPT2Tokenizer', 'HerbertTokenizer', 'LayoutLMTokenizer', 'LayoutLMv2Tokenizer', 'LayoutLMv3Tokenizer', 'LayoutXLMTokenizer', 'LongformerTokenizer', 'LEDTokenizer', 'LxmertTokenizer', 'MarkupLMTokenizer', 'MBartTokenizer', 'MBart50Tokenizer', 'MPNetTokenizer', 'MobileBertTokenizer', 'MvpTokenizer', 'NllbTokenizer', 'OpenAIGPTTokenizer', 'PegasusTokenizer', 'Qwen2Tokenizer', 'RealmTokenizer', 'ReformerTokenizer', 'RemBertTokenizer', 'RetriBertTokenizer', 'RobertaTokenizer', 'RoFormerTokenizer', 'SeamlessM4TTokenizer', 'SqueezeBertTokenizer', 'T5Tokenizer', 'UdopTokenizer', 'WhisperTokenizer', 'XLMRobertaTokenizer', 'XLNetTokenizer', 'SplinterTokenizer', 'XGLMTokenizer', 'LlamaTokenizer', 'CodeLlamaTokenizer', 'GemmaTokenizer', 'Phi3Tokenizer']

Ignore this.
I figured out the issue. fix_tokenizer= False in unsloth solves this issue.

deepak-banka changed discussion status to closed

Sign up or log in to comment