671B params vs 685B params?

by masel99 - opened 14 days ago

14 days ago

Hi, it's probably a stupid question, but I'm wondering why this model has 671B parameters and the original model has 685B parameters. Was the number of parameters reduced during the conversion or is this based on the older V3 model (Deepseek increased from 671B to 685B, no?) or something?

bartowski

Owner 14 days ago

That's a great question 🤔 I think it's just an auto-filled value and not being read from the model weights themselves, but I could be wrong..

@reach-vb ?

masel99

14 days ago

•

edited 14 days ago

thanks for fast answer
found out the same question was asked on https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF/discussions/8, it seems the difference is some additional stuff GGUFs don't include

bartowski

Owner 13 days ago

oh nice thanks for the link, that makes sense!

MrDevolver

13 days ago

•

edited 13 days ago

When you look at the infamous Perplexity copy of R1, it also has different number of parameters than the original R1 model and yet it's supposedly the same model, only "uncensored". I mean the safetensor versions have different parameters, so I don't think that theory about GGUF not including all the layers for whatever reason is correct.

masel99

10 days ago

It's about the 14B of the Multi-Token Prediction (MTP) Module weights the new V3 model has as far as i know. These are not necessarily supported by the inferencing software.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment