Llama.cpp hybrid layer quantization of QwQ 32B by Qwen

Original model: https://huggingface.co/Qwen/QwQ-32B

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant was optimized for high performance across a set of test prompts with ~IQ4_XS size. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:

   LAYER_TYPES='[
   [0 ,"Q4_K_M"],[1 ,"Q4_K_S"],[2 ,"Q3_K_L"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
   [8 ,"Q3_K_M"],[9 ,"Q3_K_M"],[10,"Q3_K_M"],[11,"Q3_K_M"],[12,"Q3_K_M"],[13,"Q3_K_M"],[14,"Q3_K_M"],[15,"Q3_K_M"],
   [16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
   [24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q3_K_L"],[29,"Q3_K_L"],[30,"Q3_K_L"],[31,"Q3_K_L"],
   [32,"Q3_K_L"],[33,"Q3_K_L"],[34,"Q3_K_L"],[35,"Q3_K_L"],[36,"Q3_K_L"],[37,"Q3_K_L"],[38,"Q3_K_L"],[39,"Q3_K_L"],
   [40,"Q4_K_S"],[41,"Q3_K_L"],[42,"Q4_K_S"],[43,"Q3_K_L"],[44,"Q4_K_S"],[45,"Q3_K_L"],[46,"Q4_K_S"],[47,"Q3_K_L"],
   [48,"Q4_K_S"],[49,"Q4_K_S"],[50,"Q4_K_S"],[51,"Q4_K_S"],[52,"Q4_K_S"],[53,"Q4_K_S"],[54,"Q4_K_S"],[55,"Q4_K_S"],
   [56,"Q4_K_M"],[57,"Q4_K_S"],[58,"Q4_K_M"],[59,"Q4_K_M"],[60,"Q4_K_M"],[61,"Q5_K_S"],[62,"Q5_K_M"],[63,"Q6_K"  ]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K --layer-types-high"

Comparison:

Quant size PPL Comment
IQ4_XS 17.9e9 5.7 Q6_K with default embedding and output
Q4_K_H 17.9e9 5.8 Hybrid quant with Q6_K embedding Q6_K output

Usage:

This is a RL trained thinking model. The layer quants for this model were optimized for high success rate on a curated set of test/eval prompts. This model exhibits overthinking when solving and this problem was found to be unsolvable by adjusting quant layers. The overthinking is baked into the model by its RL training. The model was clearly trained not to just pull a formula out of its latent space, apply it and call it done, but to reason its way through solutions to form a logical framework for its final presented solution. The quant was evaluated with greedy sampling and found to be stable on all test set problems but often ran out of tokens while clearly on its way to a correct solution due to overthinking with ~8k context. To solve moderate difficulty problems a minimum of 8k and recommended 16k context size should be configured. 8b KV vs F16 KV is recommended help increase context size. The model can be fully offloaded to two 4070s over RPC with ~25t/s gen rate when using a downstream speculator with llama.cpp server. Qwen3 0.6B is the recommended speculator for the model. At llama.cpp b6045 universal speculator support was added to upstream llama.cpp but the upstream speculation functionality was not evaluated in tests.

Benchmarks:

A set of math benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
QwQ-32B.Q4_K_H.gguf Q4_K_H 17.9e9 B ~IQ4_XS size

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
51
GGUF
Model size
32.8B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/QwQ-32B-Hybrid-GGUF

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Quantized
(172)
this model