Wahoo thanks for sharing your work!

#1
by ubergarm - opened

Hey good to see you again! Great job cooking and releasing more ik quants! I appreciate the example commands and graphs with some benchmarks, very nice and thoughtful! This looks like a nice quant and folks have been asking me for a bigger one for the 512-768GB class rigs.

If I had two things I might suggest, which are not criticism, but my wishlist haha:

  1. I've been encouraging folks and adding ik_llama.cpp to the model card tag at the top to help people find ik's quants e.g.
👈 huggingface readme modelcard tags
---
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-R1-0528
license: mit
base_model_relation: quantized
tags:
- mla
- imatrix
- conversational
- ik_llama.cpp
---
  1. I know it takes forever, but I'd be curious to compare the final PPL and not a graph of the first few blocks e.g. Final estimate: PPL = 3.2688 +/- 0.01739. I'd love to compare some of my quants to this one and having the same methodology would make that easier. I hope to release some comparisons and you could use them to compare your own as well then! No pressure, I know it takes a long time to let it finish haha...
👈 specific perplexity methodology
# i grabbed wiki.test.raw from ik's github link, he also has a huggingface repo with test files too

# here is the exact file i use
$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ du -h wiki.test.raw
1.3M    wiki.test.raw
$ sha256sum wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08  wiki.test.raw

# with your specific numa/threads/offloading etc, just let it run to finish
$ numactl -N 0,1,2 --interleave=0,1,2 \
    ./build/bin/llama-perplexity \
        --model "$model" \
        -mla 3 -fa \
        -amb 512 \
        -rtr \
        -fmoe \
        -f wiki.test.raw \
        --seed 1337 \
        --threads 128 \
        --numa numactl \
        2>&1 | tee -a $logfile

.
.
.

Final estimate: PPL = 3.2688 +/- 0.01739

Thanks, I was looking for the huggungface docs on how to add all the extra metadata to the model, and DeepSeek-R1 info was a little outdated/misleading.

I'll dig through the logs and publish the exact perplexity numbers.

Yeah I've noticed that the huggingface model card sidebar doesn't work for some ik quants e.g. this one at least exists but has blank entries for the ik quants: https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF?show_file_info=IQ2_K_R4%2FDeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf

You can get it yourself with ./gguf-py/scripts/gguf_dump.py but depends on if all the quant types codes have been implemented in ik's python as that typically lags behind the cpp. I believe some folks did update it fairly recently. Let me know if you get that working to view gguf metadata.

EDIT: ooh u mean the readme.md modelcard tags, yes, I didn't read the docs and just hacked on it a bit until it worked okay lol...

I still need to figure out how to edit the metadata to add in some extra fields like who made it, the URL etc, but that is more "nice to have" and I've skipped it every time just to rush the quant out the door haha...

Yeah, the ik_llama quants like _R4 and _R8 are not supported by vanilla gguf-dump. I had to manually patch the python code in gguf-dump venv to get it working with these quants. Maybe worth posting the gguf-dump as well, since it's not easily accessible.

Thank you for sharing this quant.
It looks exactly like something I would want to use. Do you have any idea on how to benchmark this quant vs ubergarm's quant of equivalent size for coding tasks?
Would an aiderbench be useful?
I'd be very interested even in something informal otherwise.
Thx!

Glad you found it useful! I made this quant to use myself, if ubergarm had a quant of similar size, I'd be using that instead of crowding the space with more of the same.

Something like aiderbench could be useful, but I don't put much trust in the formal benchmarks, because models tend to get over-optimized to perform well on those and sometimes benchmark gains don't transfer to the real world applications. The other measure not captured well by the formal benchmarks are how badly models blunder when they fail: are they a little bit off, or produce complete giberrish. If it's measured as a best attempt out of 3, then it doesn't matter for the score. However, in practical applications with multiple steps, these extreme failures can get the model stuck permanently, where it can't dig itself out of the pit.

In terms of informal benchmarking, I have an agent that attempts to implement a 3d game over several iteration. I run it many times and evaluate how far it's gotten before going off the rails. Smaller quants consistently go off the rails much sooner than larger quants. A simpler version of this is to ask the model to implement the spinning hexagon benchmark, and see how many attempts it takes to produce working code that meets all the requirements without additional fixes. In the extreme case of Q1 quants, they never produce anything that runs, Q2 quants produce code that runs but most of the time doesn't meet the requirements. Larger quants can usually solve this, but then it becomes the question of how many times they need to try. You can also evaluate how badly they blunder: is it a catastrophic failure that would break the chain of dependent tasks, or a minor import issues that is easily fixed.

Another interesting observation I had for DeepSeek quants, is the chain of thought tends to be in shorter paragraphs for smaller quants. When you go from Q2 to Q4 you can see the length difference in each "thought" as it scrolls on your terminal because newlines usually separate distinct "thoughts". Larger quants will have coherent longer "thoughts" that probe deeper into the problem domain, while the smaller quants will have shallow surface level "thoughts" that are less useful.

Hope this helps!

I'd be curious to see some kind of aiderbench too, but for simplicity I use the built in llama-perplexity to measure Perplexity on wiki.test.raw as well as KLD on a personal unpublished novel text test corpus.

Interestingly @anikifoss chose to not use imatrix and I'm very curious if it effects the numbers much or not. I've since released a larger model using IQ5_KS and IQ4_KS which increase speed a bit at the cost of quality. I believe this repos DQ4_K_R4 should be about 413.2 GiB adding up the GBs for each file which is still the most chonky ik quant published in terms of pure BPW psure given it is using IQ6_K for ffn_down and IQ4_K for (gate|up). My biggest one is the IQ4_KS_R4 is 4.701 BPW (368GiB) now.

perplexity.png

Huh, thanks for sharing the gguf-dump and perplexity values of your quant as well. Interestingly despite using the same wiki.test.raw the values do not look comparable to mine seen in the above chart.

If I normalize mine to my own with something like np.log(quant/base), then perhaps I could compare them.

# np.log is natural log ln()
>>> import numpy as np

# values from anikifoss
>>> base=3.5184
>>> qs = [3.5184, 3.5308, 3.5415, 3.8099, 3.9535]
>>> [np.log(q/base)*100 for q in qs]
# base, DQ4_K_R4, Q4_K_R4, DQ2_K_R4, Q2_K_R4
[0.0, 0.35181333455809527, 0.6544025393180614, 7.959660125663863, 11.659492171048505]

# values from ubergarm quants
>>> base=3.2199
>>> qs = [3.2199, 3.2286, 3.2730,3.5069, 4.8831]
>>> [np.log(q/base)*100 for q in qs]
# base, IQ4_KS_R4, IQ3_K_R4, IQ2_K_R4, IQ1_S_R4
[0.0, 0.26983035678405687, 1.6356692345592185, 8.538215317827454, 41.64299609099737]

# lower is better. keep in mind this was wiki.test.raw english and not coding stuff or other languages etc.

Sorry its hard to make sense of without a graph haha... My guess is my numbers are lower for similar sized quants because I used imatrix, but in terms of "does it vibe code better" I honestly couldn't tell you haha... Or it could just be anikifoss's numbers were scaled larger for some reason and so they aren't actually comparable.

Anyway, exciting to have more high quality ik quants to choose from! Thanks again for publishing and sharing all the details!

The Q8_0 baseline should not be affected by quantization issues or imatrix. I couldn't find the exact source for my perplexity sample, I assumed it was the same as posted in ik_llama discussion, but maybe not.

I assumed it was the same as posted in ik_llama discussion

Yeah I've posted my methodology but it is buried in various folds and I can never find it myself lol (and its slightly changed). I thought you used the same though, but don't see the reference on the model card now, i thought i had seen it.. anyway here is what I'm doing currently:

$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ du -h wiki.test.raw
1.3M    wiki.test.raw
$ sha256sum wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08  wiki.test.raw

$ ./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_K_R4.gguf \
    -f wiki.test.raw \
    --seed 1337 \
    --ctx-size 512 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --n-gpu-layers 99 \
    -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=CUDA0" \
    -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \
    --override-tensor exps=CPU \
    --threads 24

Final estimate: PPL = 3.2730 +/- 0.01738

Yeah, I just removed the reference from the model card to avoid confusion (I copy pasted the link from ik_llama assuming it was the same)

Yeah, I just removed the reference from the model card to avoid confusion (I copy pasted the link from ik_llama assuming it was the same)

Oooh, I see, you just deleted the reference because the file you used was different from the one i used if I understand you correctly. That would make sense why we have different values then. Thanks for clarifying!

Isn't the problem caused by the same data being used for both the quantization optimization (IMatrix) and the test (perplexity) ?
How hard would it be to compute perplexity scores on (a subset of) the stack or any other code dataset ?
I think it would give us an idea of the impact of the IMatrix on code generation, if any.
What do you think ?

What makes LLMs valuable for coding is not just the knowledge of a particular programming language, but also knowledge about all the problem domains. In other words, when you're trying to automate something, it helps to have an encyclopedic knowledge about the task you're automating. For small quants, imatrix is good at improving specific benchmarks at the expense of general knowledge. If you have a narrow problem domain (like game development) and a fixed language, like Python, then imatrix could be useful to produce a model that is good at game development in Python while being very compact. The approach I've taken is the opposite: a generalist model that is large and in charge.

Isn't the problem caused by the same data being used for both the quantization optimization (IMatrix) and the test (perplexity) ?

Heya! I purposely do not use wiki.test.raw or wiki test in general in my imatrix corpus to avoid potentially overfitting it as it is a common benchmark corpus (that I also use). For this same reason I do my KLD calculations on a private corpus of "novel" text that likely has not been used for training or imatrix fitting etc. Mine includes a variety of text, code, maths, and various languages to hopefully not over-fit any single domain, but who knows really!

How hard would it be to compute perplexity scores on (a subset of) the stack or any other code dataset ?

You can run any utf8 text file as the corpus if you prepare it. Just replace wiki.test.raw in my commands with your file yes.

I think it would give us an idea of the impact of the IMatrix on code generation, if any.
What do you think ?

And as @anikifoss says:

The approach I've taken is the opposite: a generalist model that is large and in charge.

Yeah at some point part of this is a bit of a dark art. lmao.. Check out this old llama.cpp discussion on the very topic going on over a year ago haha... Unsloth is now using some 12k context and synthetic per model architechture imatrix corpus supposedly using specific tokenizations and such. Bartowski has been mixing up his and especially had to add more tokens for Qwen3-30B-A3B as the experts can be quite sparse and need a varity of data to even activate.

With aniki's quant it is so big that imatrix probably wouldn't help much as it imatrix is supposedly only really helpful for under 4bpw or so.

We can take measurements of PPL and KLD but they are quite sensitive to exact parameters and corpus used as well which can make it hard to compare "apples-apples" but I find it useful for comparing a collection of quants of the same model all made with the same imatrix and same way at least. I do sometimes compare across quants of he same model but it becomes more difficult to say much beyond that.

Anyway its fun stuff for sure! I learned a lot today and thanks for the tip on that pesky attn_k_b tensor earlier today too, aniki!

@ubergarm I see you're specifying the seed and the ctx is only 512. Usually I test perplexity with 32k context, so that is likely causing the discrepancy. I will re-run the tests overnight using the same seed and ctx and post the results in this thread.

@ubergarm here are the perplexity numbers:

>>>>>> Q2_K_R4
Final estimate: PPL = 3.7371 +/- 0.02053
>>>>>> DQ2_K_R4
Final estimate: PPL = 3.5520 +/- 0.01928
>>>>>> Q4_K_R4
Final estimate: PPL = 3.2368 +/- 0.01714
>>>>>> DQ4_K_R4
Final estimate: PPL = 3.2276 +/- 0.01708
>>>>>> Q8_0
Final estimate: PPL = 3.2121 +/- 0.01698
And the full command line (click to expand)
echo ">>>>>> DQ4_K_R4" && \
./build/bin/llama-perplexity \
    --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ4_K_R4/DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf \
    -f /mnt/data/Datasets/wiki.test.raw \
    --no-mmap \
    -ctk f16 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --seed 1337 \
    --ctx-size 512 \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --parallel 1 \
    --threads 32 && \
echo ">>>>>> Q2_K_R4" && \
./build/bin/llama-perplexity \
    --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q2_K_R4/DeepSeek-R1-0528-Q2_K_R4.gguf \
    -f /mnt/data/Datasets/wiki.test.raw \
    --no-mmap \
    -ctk f16 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --seed 1337 \
    --ctx-size 512 \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --parallel 1 \
    --threads 32 && \
echo ">>>>>> DQ2_K_R4" && \
./build/bin/llama-perplexity \
    --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-DQ2_K_R4/DeepSeek-R1-0528-DQ2_K_R4.gguf \
    -f /mnt/data/Datasets/wiki.test.raw \
    --no-mmap \
    -ctk f16 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --seed 1337 \
    --ctx-size 512 \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --parallel 1 \
    --threads 32 && \
echo ">>>>>> Q4_K_R4" && \
./build/bin/llama-perplexity \
    --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q4_K_R4/DeepSeek-R1-0528-Q4_K_R4.gguf \
    -f /mnt/data/Datasets/wiki.test.raw \
    --no-mmap \
    -ctk f16 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --seed 1337 \
    --ctx-size 512 \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --parallel 1 \
    --threads 32 && \
echo ">>>>>> Q8_0" && \
./build/bin/llama-perplexity \
    --model /mnt/data/Models/anikifoss/DeepSeek-R1-0528-Q8_0/DeepSeek-R1-0528-Q8_0.gguf \
    -f /mnt/data/Datasets/wiki.test.raw \
    --no-mmap \
    -ctk f16 \
    -mla 3 -fa \
    -amb 1024 \
    -fmoe \
    --seed 1337 \
    --ctx-size 512 \
    -b 2048 -ub 2048 \
    --n-gpu-layers 99 \
    --override-tensor exps=CPU,attn_kv_b=CPU \
    --parallel 1 \
    --threads 32

OK. I'm downloading this quant to compare it with DeepSeek-R1-0528-IQ4_KS_R4 . I'll see how both fare on my programming tasks !

@BernardH that's awesome! Please keep us posted: what language/domain you applied them to and the results.

@anikifoss

here are the perplexity numbers

I think your quant "wins" at the best reported perplexity quant at least that I've seen published with this methodology! Very nice job!

Wow thanks so much for being thorough and including the commands and everything. Your Q8_0 seems to be very close to what mine was so our methodologies likely align sufficiently to compare, but of course this is just wiki.test.raw so I don't want to generalize too much.

I don't know your DQ4_K_R4 exact size GiB and BPW (I use the numbers printed out in debug logs of llama-server grep'ing for BPW), but did a rough estimate using the file sizes reported by huggingface to add yours to this graph:

ppl-r1-0528-dq4_k_r4-ubergarm.png

Not sure if you are interested in doing another one, but if you went with all iq5_ks for down/gate/up it would likely be faster without sacrificing much/any quality I'm guessing. It would end up a similar size or even possibly bigger (i haven't calculated exactly). The iq4/5_ks quants tend to be faster psure, although I'm not fully sure on this model given a lot of CPU offload.

No pressure at all, just day dreaming about what other possible options would work in the larger size range you seem to enjoy!

Cheers!

See the differences between - ctk f16 and -ctk q8_0 here, so that could explain these scores.

Yeah -ctk q8_0 usually has a slightly "worse" PPL than full -ctk f16 similar to your linked data is showing. That matches my experience here as I ran my baseline pure q8_0 quant both ways at some point:

  • Q8_0 -ctk f16: 3.2119 +/- 0.01697
  • Q8_0 -ctk q8_0 3.2130 +/- 0.01698

Its not too bad though, and I do use q8_0 a lot if I want the extra VRAM for something else like context or offload another layer.

@BernardH that's awesome! Please keep us posted: what language/domain you applied them to and the results.

I must have a corrupted file "Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-5): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 256" :'(
Would you mind posting hashs (md5sum ? whatever) for the 10 files so that I know which one to download again (of course I didn't use xet because it was so slooooow last time I tried).
Thx !

@BernardH that's awesome! Please keep us posted: what language/domain you applied them to and the results.

I must have a corrupted file "Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-5): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 256" :'(
Would you mind posting hashs (md5sum ? whatever) for the 10 files so that I know which one to download again (of course I didn't use xet because it was so slooooow last time I tried).
Thx !

A quick way to get all the checksums is to git-clone the huggingface repo without the LFS plugin. Then each large file is simply a placeholder text file with some metadata, including sha256.

I also run sha256 manually to make sure the original source matches the repo (it does).

a532a14dffe840f8c6a394417b3109f3f863323230bf76faba8ba1a11de43e79 *DeepSeek-R1-0528-DQ4_K_R4-00001-of-00010.gguf
1f1acfdd50e4dfc2b0544bb568d42abc1e6b89fa93932e30e8a0093356e6f930 *DeepSeek-R1-0528-DQ4_K_R4-00002-of-00010.gguf
0a7a7746d28e4ba7a570209e4e311614716b6668480239ba9630b34c8252a7b5 *DeepSeek-R1-0528-DQ4_K_R4-00003-of-00010.gguf
edd9a617b2b123c2c9c21e2a591959c2f888d38df40a126c13a047ff1bef8d8c *DeepSeek-R1-0528-DQ4_K_R4-00004-of-00010.gguf
adeec3be8993aec156841c31a2f15d936b74add28d85fdfdbd3bfaa3c90daf0e *DeepSeek-R1-0528-DQ4_K_R4-00005-of-00010.gguf
888a7b26a560ed35641304cf75e6e6045cc2887b450c18d3fffcf28ba78a95c9 *DeepSeek-R1-0528-DQ4_K_R4-00006-of-00010.gguf
98bbb26be716a8f9df5365bdbc733339c9b6cf982093885362e39c1ca608ef65 *DeepSeek-R1-0528-DQ4_K_R4-00007-of-00010.gguf
b9a3eca73b3b2986e5dc69791aaa473b0659ee71a35370ef3d902622f9fe9396 *DeepSeek-R1-0528-DQ4_K_R4-00008-of-00010.gguf
7f40cfa6dd62f2c45fdee939120705debf15fd9a108ec76b3840a58d4515a5ec *DeepSeek-R1-0528-DQ4_K_R4-00009-of-00010.gguf
5579f17934f1df9c53dcc9870b80ede5d2fc0af9e03a7c394ae322bf7570afac *DeepSeek-R1-0528-DQ4_K_R4-00010-of-00010.gguf

Thx !
Running a sweep bench right now.
I noticed that your example command contains the following sampling parameter values « --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 ». Any reason to have these instead of the recommended values of «--temp 0.6 --top-p 0.95» from https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF ?

Those are just my preferences that I found work best for coding. From my experience, min_p is more predictable, since it avoid highly random tokens altogether, so the model stays on track for longer tasks. Take a look at this deep dive into how min_p works. However, this is largely a matter of preference, so what works for you will likely be different.

@anikifoss

This ik_llama.cpp PR533 just opened that could boost your quants PP quite a bit, especially for larger batch sizes! Spread the good word! haha

Sign up or log in to comment