Tokenizer doesn't load with transformers 4.34.4

#21
by imdatta0 - opened

As mentioned in the model card, transformers==4.34.4 (edit:transformers==4.43.4) doesn't seem to work while loading tokenizer. It seems to work fine on transformers==4.45.0. The underlying tokenizers versions are 0.19.1 and 0.20.3 respectively. The tokenizer throws the following error:

{
    "name": "Exception",
    "message": "data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3",
    "stack": "---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[1], line 2
      1 from transformers import AutoTokenizer
----> 2 tokenizer = AutoTokenizer.from_pretrained(\"/mnt/model_pvc/models/Llama-3.3-70B-Instruct\")

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py:896, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    892     if tokenizer_class is None:
    893         raise ValueError(
    894             f\"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported.\"
    895         )
--> 896     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    898 # Otherwise we have to be creative.
    899 # if model is an encoder decoder, the encoder tokenizer class is used by default
    900 if isinstance(config, EncoderDecoderConfig):

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2291, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2288     else:
   2289         logger.info(f\"loading file {file_path} from cache at {resolved_vocab_files[file_id]}\")
-> 2291 return cls._from_pretrained(
   2292     resolved_vocab_files,
   2293     pretrained_model_name_or_path,
   2294     init_configuration,
   2295     *init_inputs,
   2296     token=token,
   2297     cache_dir=cache_dir,
   2298     local_files_only=local_files_only,
   2299     _commit_hash=commit_hash,
   2300     _is_local=is_local,
   2301     trust_remote_code=trust_remote_code,
   2302     **kwargs,
   2303 )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2525, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2523 # Instantiate the tokenizer.
   2524 try:
-> 2525     tokenizer = cls(*init_inputs, **init_kwargs)
   2526 except OSError:
   2527     raise OSError(
   2528         \"Unable to load vocabulary from file. \"
   2529         \"Please check that the provided vocabulary is accessible and not corrupted.\"
   2530     )

File ~/.venvs/pyenv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py:115, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    112     fast_tokenizer = copy.deepcopy(tokenizer_object)
    113 elif fast_tokenizer_file is not None and not from_slow:
    114     # We have a serialization from tokenizers which let us directly build the backend
--> 115     fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    116 elif slow_tokenizer is not None:
    117     # We need to convert a slow tokenizer to build the backend
    118     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

Exception: data did not match any variant of untagged enum ModelWrapper at line 1251003 column 3"
}

There seem to be similar issues on other models across repos like 1, 2 etc.
Is it because some config is missing or is it a version mismatch ? If that is the case, can we please mention that 4.45.0 is necessary in the model card?

Meta Llama org

Hi @imdatta0 - Thanks for opening the issue. This is expected on older versions of transformers due to an update to transformers. That's why we mention using 4.43 and above: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/README.md?code=true#L85

Where did you come across 4.34.4- I'd be happy to fix that?

Hey @reach-vb , sorry I made a couple of typos in the issue. The issue happens in 4.43.4 (instead of the mentioned 4.34.4)
The issue doesn't happen in 4.45.0. I have tried a few versions in between and doesn't work.

Meta Llama org

Thanks for the info @imdatta0 - let me patch that

Meta Llama org

Documentation has been updated. Supports 4.45.0 and later only.

vontimitta changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment