Tokenizer doesn't load with transformers 4.34.4

#21

by imdatta0 - opened Dec 9, 2024

Dec 9, 2024

•

edited Dec 9, 2024

As mentioned in the model card, ~~transformers==4.34.4~~ (edit:transformers==4.43.4) doesn't seem to work while loading tokenizer. It seems to work fine on transformers==4.45.0. The underlying tokenizers versions are 0.19.1 and 0.20.3 respectively. The tokenizer throws the following error:

{
    "name": "Exception",
    "message": "data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3",
    "stack": "---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 2
      1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained(\"/mnt/model_pvc/models/Llama-3.3-70B-Instruct\")

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:896, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    892     if tokenizer_class is None:
    893         raise ValueError(
    894             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    895         )
--> 896     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    898 # Otherwise we have to be creative.
    899 # if model is an encoder decoder, the encoder tokenizer class is used by default
    900 if isinstance(config, EncoderDecoderConfig):

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2291, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2288     else:
   2289         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2291 return cls._from_pretrained(
   2292     resolved_vocab_files,
   2293     pretrained_model_name_or_path,
   2294     init_configuration,
   2295     *init_inputs,
   2296     token=token,
   2297     cache_dir=cache_dir,
   2298     local_files_only=local_files_only,
   2299     _commit_hash=commit_hash,
   2300     _is_local=is_local,
   2301     trust_remote_code=trust_remote_code,
   2302     **kwargs,
   2303 )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2525, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2523 # Instantiate the tokenizer.
   2524 try:
-> 2525     tokenizer = cls(*init_inputs, **init_kwargs)
   2526 except OSError:
   2527     raise OSError(
   2528         \"Unable to load vocabulary from file. \"
   2529         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2530     )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py:115, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    112     fast_tokenizer = copy.deepcopy(tokenizer_object)
    113 elif fast_tokenizer_file is not None and not from_slow:
    114     # We have a serialization from tokenizers which let us directly build the backend
--> 115     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    116 elif slow_tokenizer is not None:
    117     # We need to convert a slow tokenizer to build the backend
    118     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3"
}

There seem to be similar issues on other models across repos like 1, 2 etc.
Is it because some config is missing or is it a version mismatch ? If that is the case, can we please mention that 4.45.0 is necessary in the model card?

reach-vb

Meta Llama org Dec 9, 2024

Hi @imdatta0 - Thanks for opening the issue. This is expected on older versions of transformers due to an update to transformers. That's why we mention using 4.43 and above: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/README.md?code=true#L85

Where did you come across 4.34.4- I'd be happy to fix that?

imdatta0

Dec 9, 2024

Hey @reach-vb , sorry I made a couple of typos in the issue. The issue happens in 4.43.4 (instead of the mentioned 4.34.4)
The issue doesn't happen in 4.45.0. I have tried a few versions in between and doesn't work.

reach-vb

Meta Llama org Dec 10, 2024

Thanks for the info @imdatta0 - let me patch that

vontimitta

Meta Llama org Dec 10, 2024

Documentation has been updated. Supports 4.45.0 and later only.

vontimitta changed discussion status to closed Dec 10, 2024

I00N

Apr 5

Hi @reach-vb , I'm currently experiencing this issue trying to deploy a finetuned and 4bit quantized version of Llama 3.1 from my private repo on huggingface, using sagemaker TGI. I tried using an upgraded version of transformers but I get the error messaage that only versions up to 4.37 are supported. I've tried building a custom container but it didn't help, I'm really running on deadlines and need all the help I can get, if you have any suggestions for me, I'm open please.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment