671B params vs 685B params?

#3
by masel99 - opened

Hi, it's probably a stupid question, but I'm wondering why this model has 671B parameters and the original model has 685B parameters. Was the number of parameters reduced during the conversion or is this based on the older V3 model (Deepseek increased from 671B to 685B, no?) or something?

That's a great question 🤔 I think it's just an auto-filled value and not being read from the model weights themselves, but I could be wrong..

@reach-vb ?

thanks for fast answer
found out the same question was asked on https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/discussions/8, it seems the difference is some additional stuff GGUFs don't include

oh nice thanks for the link, that makes sense!

When you look at the infamous Perplexity copy of R1, it also has different number of parameters than the original R1 model and yet it's supposedly the same model, only "uncensored". I mean the safetensor versions have different parameters, so I don't think that theory about GGUF not including all the layers for whatever reason is correct.

It's about the 14B of the Multi-Token Prediction (MTP) Module weights the new V3 model has as far as i know. These are not necessarily supported by the inferencing software.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment