Llama.cpp hybrid layer quantization of QwQ 32B by Qwen
Original model: https://huggingface.co/Qwen/QwQ-32B
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant was optimized for high performance across a set of test prompts with ~IQ4_XS size. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:
LAYER_TYPES='[
[0 ,"Q4_K_M"],[1 ,"Q4_K_S"],[2 ,"Q3_K_L"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
[8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
[32,"Q3_K_L"],[33,"Q3_K_L"],[34,"Q3_K_L"],[35,"Q3_K_L"],[36,"Q3_K_L"],[37,"Q3_K_L"],[38,"Q3_K_L"],[39,"Q3_K_L"],
[40,"Q4_K_S"],[41,"Q3_K_L"],[42,"Q4_K_S"],[43,"Q3_K_L"],[44,"Q4_K_S"],[45,"Q3_K_L"],[46,"Q4_K_S"],[47,"Q3_K_L"],
[48,"Q4_K_S"],[49,"Q4_K_S"],[50,"Q4_K_S"],[51,"Q4_K_S"],[52,"Q4_K_S"],[53,"Q4_K_S"],[54,"Q4_K_S"],[55,"Q4_K_S"],
[56,"Q4_K_M"],[57,"Q4_K_S"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q5_K_S"],[62,"Q5_K_M"],[63,"Q6_K" ]
]'
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"
Comparison:
Quant | size | PPL | Comment |
---|---|---|---|
IQ4_XS | 17.9e9 | 5.7 | Q6_K with default embedding and output |
Q4_K_H | 17.9e9 | 5.8 | Hybrid quant with Q6_K embedding Q6_K output |
Usage:
This is a RL trained thinking model. The layer quants for this model were optimized for high success rate on a curated set of test/eval prompts. This model exhibits overthinking when solving and this problem was found to be unsolvable by adjusting quant layers. The overthinking is baked into the model by its RL training. The model was clearly trained not to just pull a formula out of its latent space, apply it and call it done, but to reason its way through solutions to form a logical framework for its final presented solution. The quant was evaluated with greedy sampling and found to be stable on all test set problems but often ran out of tokens while clearly on its way to a correct solution due to overthinking with ~8k context. To solve moderate difficulty problems a minimum of 8k and recommended 16k context size should be configured. 8b KV vs F16 KV is recommended help increase context size. The model can be fully offloaded to two 4070s over RPC with ~25t/s gen rate when using a downstream speculator with llama.cpp server. Qwen3 0.6B is the recommended speculator for the model. At llama.cpp b6045 universal speculator support was added to upstream llama.cpp but the upstream speculation functionality was not evaluated in tests.
Benchmarks:
A set of math benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
QwQ-32B.Q4_K_H.gguf | Q4_K_H | 17.9e9 B | ~IQ4_XS size |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 51