DeepSeek-R1-0528-DQ2_K_R4
Thank you for all this interesting work. Would you be able to upload DeepSeek-R1-0528-DQ2_K_R4 please?
Thank you! I could upload the DeepSeek-R1-0528-DQ2_K_R
quant, but it's mainly there for performance comparision. I would not recommend using it, because it was made without imatrix. For quants smaller than Q4, you will get much better results when using imatrix. Instead, I reommend to take a look at some of the
@ubergarm
quants here. The IQ2 quant should be much better quality while being comparable in size and performance to my DQ2_K_R4
quant.
Ah, thank you for the clarification, I was under the impression that it was the opposite quality-wise. That makes sense now.
My focus is on the larger models, which should offer better quality without using the importance matrix (imatrix). I noticed a huge improvement in the quality of generated code when going from Q2 to Q4, so I'm avoiding heavy quantizations or any significant alteration to the base model.
Going below Q4, and epsecially into Q2 size, significantly hurts the model performance on some tasks, so you really need imatrix when so much of the model data is discarded. An importance matrix guides the model towards retaining knowledge that is similar to the dataset used to create the imatrix. When a model is heavily quantized, then it will forget something, because so much data is removed from the model. So while imatrix helps the model to retain the knowledge covered by imatrix it will also cause the model to forget things not in the imatrix dataset.
I added benchmarks for Q2_K_R and DQ2_K_R so people can get a sense of how much token generation performance they are giving up for the improved quality of the output. At the same time you can also get a sense about how much the model is degraded by different quantization strategies by looking at the perplexity, to help you make a more informed decision on what tradeoffs are right for you.
That makes perfect sense. I've been using Q8 models for as long as I could, specifically because I don't want these quality tradeoffs to impact my work. But the cost involved and the crippled speed (specifically PP) for DeepSeek to even run it at Q4 is quite challenging for most (including myself). But you are definitely right, if the calibration data for producing the imatrix doesn't match your use-case then it will impact your work.
I wish we could just have (as well as a general massive model) separate smaller models with experts tailored to specific tasks or languages such as what Qwen (and others) used to do with Coding/Vision/General models. So that hardware-poor plebs can load and operate these "expert" models at blasting fast speeds and switch between them at will. Providing smaller model-expert based models as well as one big MoE model... but that's just my opinion.
All I can run on my current hardware is IQ2 at okay-ish speed and IQ3 at snail speed. Q4 would require a whole system upgrade.