MoE Expert Key Naming Mismatch in Unsloth Dynamic 4-bit Checkpoint
When attempting to load this model using FastLanguageModel.from_pretrained I am getting no GPU VRAM usage and a slow steady use of system CPU RAM until OOM error. I have tried many different user options with none making a difference. Upon close inspection with running the code in debug I can see that Hugging Face’s loader finds a massive discrepancy between the keys the meta-model expects and the keys in the checkpoint shards. As a result, all MoE expert parameters are marked missing or unexpected, leading to CPU offloadof hundreds of tensors and eventual OOM.
Two concrete mismatches:
- The meta-model’s state_dict keys include an explicit expert index, e.g:
language_model.model.layers.0.feed_forward.experts.0.down_proj.weight
language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight - The dynamic checkpoint names omit the index and add BitsAndBytes suffixes, e.g:
language_model.model.layers.0.feed_forward.experts.down_proj.weight.quant_state.bitsandbytes__nf4
language_model.model.layers.0.feed_forward.experts.down_proj.weight.absmax
Hugging Face’s _find_missing_and_unexpected_keys() does a strict set difference, so every expert param is dropped to CPU. The loader never uses GPU and system RAM climbs until OOM.
Versions:
(huggingface-hub-0.34.3, safetensors-0.6.1, tokenizers-0.21.4 tqdm-4.67.1, transformers-4.55.0)
(accelerate-1.9.0 aiohappyeyeballs-2.6.1 aiohttp-3.12.15 aiosignal-1.4.0 async-timeout-5.0.1 bitsandbytes-0.46.1 cut_cross_entropy-25.1.1 datasets-3.6.0 diffusers-0.34.0 dill-0.3.8 docstring-parser-0.17.0 frozenlist-1.7.0 fsspec-2025.3.0 hf_transfer-0.1.9 markdown-it-py-3.0.0 mdurl-0.1.2 mpmath-1.3.0 msgspec-0.19.0 multidict-6.6.3 multiprocess-0.70.16 networkx-3.4.2 peft-0.17.0 propcache-0.3.2 pyarrow-21.0.0 rich-14.1.0 sentencepiece-0.2.0 shtab-1.7.2 sympy-1.14.0 torch-2.7.1 torchvision-0.22.1 triton-windows-3.4.0.post20 trl-0.21.0 typeguard-4.4.4 typing-extensions-4.14.1 tyro-0.9.27 unsloth-2025.8.1 unsloth_zoo-2025.8.1 xformers-0.0.31.post1 xxhash-3.5.0 yarl-1.20.1)
Relevant code locations:
- HF core:
_load_state_dict_into_meta_model
in modeling_utils.py (line ~743) - HF core:
_find_missing_and_unexpected_keys
in modeling_utils.py (line ~1511)
(look at expected_keys, checkpoint_keys, and the resulting missing_keys and unexpected_keys) - HF core 'model._move_missing_keys_from_meta_to_cpu(missing_keys + mismatched_keys, unexpected_keys, dtype, hf_quantizer) (line 5389 in _load_pretrained_model in modelling_utils.py)
I believe the component responsible for this alignment is Unsloth, which needs to synchronize its expert naming conventions and metadata suffixes with those expected by the meta-model. Ideally, this would involve upstream normalization of the expert index and bitsandbytes suffixes within Unsloth’s dynamic 4-bit loader/publisher. I have also documented this issue at https://github.com/unslothai/unsloth/issues/3105