DeepSeek-R1-AWQ quantized model missing one layer of experts

#28

by virilo - opened 5 days ago

Discussion

virilo

5 days ago

•

edited 5 days ago

Hi everyone,

I downloaded the quantized DeepSeek-R1-AWQ model from Hugging Face, and upon inspecting the model.safetensors.index.json, I noticed something curious: the expert layers start at layer 3, matching the original DeepSeek-R1, but they stop at layer 60. The original model clearly has experts all the way up to layer 61.

Does anyone know why the last expert layer (layer 61) is missing in this quantized version? Is this intentional? Which specific layer was removed? Is there any quantized model that can be executed with SGLang/vLLM on 8x80GB H100 GPUs while retaining all the expert layers?

Thanks in advance for any insights!

virilo changed discussion title from DeepSeek-R1-AWK quantized model missing one layer of experts to DeepSeek-R1-AWQ quantized model missing one layer of experts 5 days ago

v2ray

Cognitive Computations org 5 days ago

The last layer is for speculative decoding (MTP), since AutoAWQ doesn't support quantizing MTP head yet, it was excluded when quantizing. It should work as intended, just a little bit slower.

v2ray changed discussion status to closed 5 days ago

virilo

5 days ago

•

edited 5 days ago

thanks @v2ray !

It looks like layer #61 in DeepSeek-R1 contains both a standard MoE expert block (identical to layers #3-60) and an additional MTP-specific block. The MTP block seems to be the final part of the layer, including:

"model.layers.N.embed_tokens.weight",
"model.layers.N.enorm.weight",
"model.layers.N.hnorm.weight",
"model.layers.N.eh_proj.weight",
"model.layers.N.shared_head.norm.weight",
"model.layers.N.shared_head.head.weight",

Since AutoAWQ does not support quantizing MTP, it removed layer #61 entirely. However, this also means that the MoE expert block in layer #61 was removed along with the MTP block.

I'm using this file as a reference, where you can find that there are also experts on layer 61 appart from MTP: https://huggingface.co/deepseek-ai/DeepSeek-R1/raw/main/model.safetensors.index.json

Just to confirm: were the MoE experts in layer #61 actually tied to MTP, or should they have been treated as independent from it (like the experts in layers #3-60)? If they were independent, could this mean that AutoAWQ unintentionally removed standard MoE functionality along with MTP?

v2ray

Cognitive Computations org 5 days ago

The entire final layer is dedicated for MTP, so removing it is OK, if you check their official weight converting code, they actually discard the final layer too.

virilo

5 days ago

thanks a lot for clarifying!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment