Notice: After further investigation into how the layers in tensors are stored, it seems like this is currently not possible. It would require a lot of rewriting the llama.cpp code which would need to be merged etc,. There was a misunderstanding of how I thought it works and how it actually works. Howerver, this is still an interesting topic to potentially explore further in the future, or with another library. I will not be exploring this any further, for now.

=== Keeping this for archival reasons ===

Qwen3-30B-A3B-By-Expert-Quantization-GGUF (BEQ)

This is an experimental implementation of MoE quantization. We use the probability of activating the expert to determine the type of quantization to do for its weights.

Currently, we are using using llama.cpp and there is no implementation of quantizing from hf model to Q4_K_M. Instead, we use Q4_0. This could provide better results. There is currently no support for Q3_0 which could be an interesting combination with Q5_0 for size-to-performance ratio. We use a threshold limit to determine which quantization type to pick.

The implementation can be found on Github.

The activation probabilities for these files is sources from kalomaze/Qwen3-16B-A3B.

Here is a table of some quick perplexity tests I did on the test version of wikitext-2-raw-v1. We used 0.285 threshold. These are short tests just to see if it works.

Simply ran with default as

./build/bin/llama-perplexity -m model_name.gguf -ngl 99 -fa -f wiki.test.raw

Comparison	Value	Size	Extra Info
q81_q4	—	—	—
q8_q51	—	—	—
q8_q5	9.1580 ± 0.07331	21G	—
q8_q41	—	—	—
q8_q4	9.1346 ± 0.07255	—	—
q51_q41	9.4782 ± 0.07698	19G	20059780992
q51_q4	9.2974 ± 0.07461	18.2G	—
q5_q41	—	—	—
q5_q4	9.1854 ± 0.07286	17G	18190432128
q5_q4-q8	9.1900 ± 0.07289	17.6G	17606997888
Qwen3-30B-A3B-UD-Q4_K_XL	9.1906 ± 0.07311	17G	17715663712

Naming format: max-quant_min-quant(q8 here is --outtype, or if left as auto).

Maybe I'll do some more testing if I get the time. But more optimization could show promising results, as this is a jank first implementation.

RDson
/

Qwen3-30B-A3B-By-Expert-Quantization-GGUF

=== Keeping this for archival reasons ===

Qwen3-30B-A3B-By-Expert-Quantization-GGUF (BEQ)

Model tree for RDson/Qwen3-30B-A3B-By-Expert-Quantization-GGUF