Notice: After further investigation into how the layers in tensors are stored, it seems like this is currently not possible. It would require a lot of rewriting the llama.cpp code which would need to be merged etc,. There was a misunderstanding of how I thought it works and how it actually works. Howerver, this is still an interesting topic to potentially explore further in the future, or with another library. I will not be exploring this any further, for now.

=== Keeping this for archival reasons ===

Qwen3-30B-A3B-By-Expert-Quantization-GGUF (BEQ)

This is an experimental implementation of MoE quantization. We use the probability of activating the expert to determine the type of quantization to do for its weights.

Currently, we are using using llama.cpp and there is no implementation of quantizing from hf model to Q4_K_M. Instead, we use Q4_0. This could provide better results. There is currently no support for Q3_0 which could be an interesting combination with Q5_0 for size-to-performance ratio. We use a threshold limit to determine which quantization type to pick.

The implementation can be found on Github.

The activation probabilities for these files is sources from kalomaze/Qwen3-16B-A3B.

Here is a table of some quick perplexity tests I did on the test version of wikitext-2-raw-v1. We used 0.285 threshold. These are short tests just to see if it works.

Simply ran with default as

./build/bin/llama-perplexity -m model_name.gguf -ngl 99 -fa -f wiki.test.raw
Comparison Value Size Extra Info
q81_q4 β€” β€” β€”
q8_q51 β€” β€” β€”
q8_q5 9.1580 Β± 0.07331 21G β€”
q8_q41 β€” β€” β€”
q8_q4 9.1346 Β± 0.07255 β€” β€”
q51_q41 9.4782 Β± 0.07698 19G 20059780992
q51_q4 9.2974 Β± 0.07461 18.2G β€”
q5_q41 β€” β€” β€”
q5_q4 9.1854 Β± 0.07286 17G 18190432128
q5_q4-q8 9.1900 Β± 0.07289 17.6G 17606997888
Qwen3-30B-A3B-UD-Q4_K_XL 9.1906 Β± 0.07311 17G 17715663712

Naming format: max-quant_min-quant(q8 here is --outtype, or if left as auto).

Maybe I'll do some more testing if I get the time. But more optimization could show promising results, as this is a jank first implementation.

Downloads last month
100
GGUF
Model size
30.5B params
Architecture
qwen3moe
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RDson/Qwen3-30B-A3B-By-Expert-Quantization-GGUF

Finetuned
Qwen/Qwen3-30B-A3B
Quantized
(86)
this model