671B params vs 685B params?
Hi, it's probably a stupid question, but I'm wondering why this model has 671B parameters and the original model has 685B parameters. Was the number of parameters reduced during the conversion or is this based on the older V3 model (Deepseek increased from 671B to 685B, no?) or something?
thanks for fast answer
found out the same question was asked on https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/discussions/8, it seems the difference is some additional stuff GGUFs don't include
oh nice thanks for the link, that makes sense!
When you look at the infamous Perplexity copy of R1, it also has different number of parameters than the original R1 model and yet it's supposedly the same model, only "uncensored". I mean the safetensor versions have different parameters, so I don't think that theory about GGUF not including all the layers for whatever reason is correct.
It's about the 14B of the Multi-Token Prediction (MTP) Module weights the new V3 model has as far as i know. These are not necessarily supported by the inferencing software.