the notes on each quant
hey sorry i couldn't find if for example you have the
Q6_K 6.7 very good quality
Q8_0 8.6 fast, best quality
does that mean the Q8_0 is faster than Q6_K?
just because the Q6 is already pushing my system but i might be able to run Q8 with if it's faster.
Q8_0 is easier to decode for your CPU, and can be faster, but other factors, such as your memory bandwidth and other details, can influence it. You'll have to try.
Unless your CPU is low-end/laptop/phone you are likely memory bandwidth bottlenecked. So usually, the smaller the quants the faster you can run them. I did a lot of performance measurements of all our quants on different CPUs which you can obtain from http://www.nicobosshard.ch/perfData.zip
I would be very surprised if on your system Q8 is faster than Q6_K. I personally always use Q5_K_M for the best performance/quality/memory trade-off.
Is the performance data measurement finished? (i.e. ready for the model page, not that I would have time for that right now)?
Is the performance data measurement finished? (i.e. ready for the model page, not that I would have time for that right now)?
Yes I compleated it one month ago: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/4#67f5cccc7640911cd446d624
I think I never uploaded the final data as above link is slightly data but I have them all localy and will upload them later today if I find time to do so.
sure, no hurry, let's just make sure we eventually make good use of it. i plan to have a more complex selection above the quant table, where you can replace the score column with both speed and quality metrics. or maybe have two columns, not sure. yes, the readme needs to be patched still.
okay so i have ollama running on intel imac and so far i can't see much difference between Q5_K_M
and Q8_0
. what's a good way to test?
okay so i have ollama running on intel imac and so far i can't see much difference between
Q5_K_M
andQ8_0
. what's a good way to test?
You will realistically never see any quality difference between Q5_K_M and Q8_0 as it is way too small for humans to notice. Everyone that claims they can tell the difference probably only think they can. The only way to tell us using synthetic measurements like perplexity, kl-divergence, top token probability and same token probability. You could run benchmarks like ARC/MMLU/WinoGrande but even for those the difference will be so small it is below any measurement error threshold and so not measurable. If you realy want to see a difference compare with like i1-IQ2_M so the difference is quite obvious.
i do think that i can tell the difference which is funny, i haven't done a blind test. also Q5_K_M is faster i think
i do think that i can tell the difference which is funny, i haven't done a blind test. also Q5_K_M is faster i think
You can tell the difference based on which one is faster because Q5_K_M is much faster compared to Q_8 assuming you are bandwidth bottleneck which you will be unless you have a potato as computer. I don't think you can tell them apart based on quality. Maybe you think you can as one of them is better was better by you randomly getting better seeds. There is no measurable difference when running evals and top token probability is so similar that even someone trained to recognize the difference would likely need a few thousand responses to tell. While the tokens chosen don’t always perfectly match the chosen tokens are on average almost equally good at their predictions. Maybe someone should do a scientific experiment about this as it is quite an interesting topic. At least I personally can for sure not tell any difference.
yeah i definately "think" i can but i've never run models with the same parameters or conditions to actually find out. i am really interested because i think there would be more links to actual data than just the math behind it. i want to do one myself but i get bored by the time my computer loads a model so i definately won't be doing it any time soon on my machine!!!
unrelated though i've quantized music before and could tell the difference when beats were quantized up to 0.0625
seconds (about 8% of a note) away from their original values. i don't know that relates to this haha or how far away the language model weights are but if there's definately a noticable difference in the lower Qs then there must be a difference in all of them? i don't know, i honestly can't tell but just thinking that it's better definately makes for a better experience so i'm happy to lie to myself hahaha.