Notice: After further investigation into how the layers in tensors are stored, it seems like this is currently not possible. It would require a lot of rewriting the llama.cpp code which would need to be merged etc,. There was a misunderstanding of how I thought it works and how it actually works. Howerver, this is still an interesting topic to potentially explore further in the future, or with another library. I will not be exploring this any further, for now.
=== Keeping this for archival reasons ===
Qwen3-30B-A3B-By-Expert-Quantization-GGUF (BEQ)
This is an experimental implementation of MoE quantization. We use the probability of activating the expert to determine the type of quantization to do for its weights.
Currently, we are using using llama.cpp and there is no implementation of quantizing from hf model to Q4_K_M. Instead, we use Q4_0. This could provide better results. There is currently no support for Q3_0 which could be an interesting combination with Q5_0 for size-to-performance ratio. We use a threshold limit to determine which quantization type to pick.
The implementation can be found on Github.
The activation probabilities for these files is sources from kalomaze/Qwen3-16B-A3B.
Here is a table of some quick perplexity tests I did on the test version of wikitext-2-raw-v1. We used 0.285 threshold. These are short tests just to see if it works.
Simply ran with default as
./build/bin/llama-perplexity -m model_name.gguf -ngl 99 -fa -f wiki.test.raw
Comparison | Value | Size | Extra Info |
---|---|---|---|
q81_q4 | β | β | β |
q8_q51 | β | β | β |
q8_q5 | 9.1580 Β± 0.07331 | 21G | β |
q8_q41 | β | β | β |
q8_q4 | 9.1346 Β± 0.07255 | β | β |
q51_q41 | 9.4782 Β± 0.07698 | 19G | 20059780992 |
q51_q4 | 9.2974 Β± 0.07461 | 18.2G | β |
q5_q41 | β | β | β |
q5_q4 | 9.1854 Β± 0.07286 | 17G | 18190432128 |
q5_q4-q8 | 9.1900 Β± 0.07289 | 17.6G | 17606997888 |
Qwen3-30B-A3B-UD-Q4_K_XL | 9.1906 Β± 0.07311 | 17G | 17715663712 |
Naming format: max-quant_min-quant(q8 here is --outtype, or if left as auto).
Maybe I'll do some more testing if I get the time. But more optimization could show promising results, as this is a jank first implementation.
- Downloads last month
- 100
4-bit
5-bit
8-bit