Llama.cpp hybrid layer quantization of Qwen3-8B by Alibaba

Original model: https://huggingface.co/Qwen/Qwen3-8B

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~6G gguf with the same perplexity as a ~6.7G Q6_K GGUF. The quants employed are all K to avoid slow CPU processing of IQ quants. In particular to extend to full 128k context length on consumer grade GPU such as 12G VRAM cards it will be necessary to partially offload to CPU so efficient CPU computation is needed. The quants range from Q4_K_M at early layers to cortex layers with Q8_0. In this case the final quant is designed to match Q6_K performance so the quant is called Q6_K_H. Note there is no unique Q6_K_H quant since the selections of quantizations to use as a function of layer are arbitrary. For this file the layer quants are as follows:

embed  : Q6_K
0..5   : Q4_K_M
6..11  : Q5_K_S
12..29 : Q5_K_M
30..32 : Q6_K
33..35 : Q8_0
output : Q6_K

These quants were select based on combined subjective and objective performance evaluations to give both high performance and moderate file size.

A second smaller Q4_K_H quant is available targeting use cases with large context needs on smaller VRAM GPUs. This quant will fit in a 12G VRAM GPU with enough space for 95000 token q8 kv with full GPU offload. The layer quant distribution was optimized to maintain strong reasoning. With think mode on or off it can correctly solve the large prompt discussed in https://huggingface.co/Qwen/Qwen3-32B/discussions/18, file https://thireus.com/REDDIT/Qwen3_Runescape_Massive_Prompt.txt, manually edited back to 85k tokens https://huggingface.co/steampunque/Qwen3-8B-GGUF/blob/main/Qwen3_Runescape_Massive_Prompt_85k.txt so it would fit in the 95k of space available with full offload.

Update: further testing showed solution of the large prompt correctly was found to be quite fragile. Changing rope scale factor even a little bit can move the model from solving it correctly to failing. There is some ambiguity in the Qwen 3 models about context size. Model card says 32k base or 128k with yarn, but config.json shows 40k base. The base context affects rope scale and long context performance. I found correct performance by arbitrarily setting base to 35840 (35k), then with a context of 95104 tokens the rope scale = 95104 / 35840 = 2.65357. Then on model start pass --rope-scaling yarn --yarn-orig-ctx 35840 --rope_scale 2.65357 (must be ajusted if kv other than 95104)

For this file the layer quants are as follows:

embed  : Q4_K
0..5   : alt Q3_K_M Q3_K_S
6..17  : Q3_K_M
18..23 : alt Q3_K_L Q3_K_M
24..29 : alt Q4_K_S Q3_K_L
30..35 : Q4_K_S Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K
output : Q6_K

Comparison:

Quant size PPL Comment
IQ4_XS 4.59e9 10.13 default embed and output
Q4_K_H 4.49e9 10.19 Q4_K embed Q6_K output
Q6_K 6.7e9 9.92 Q6_K with default embedding and output
Q6_K_H 6.05e9 9.96 Hybrid quant with Q6_K embedding Q6_K output

Download the file from below:

Link Type Size/e9 B Notes
Qwen3-8B.Q4_K_H.gguf Q4_K_H 4.49e9 B ~IQ4_XS size
Qwen3-8B.Q6_K_H.gguf Q6_K_H 6.05e9 B Q6_K quality

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
26
GGUF
Model size
8.19B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Qwen3-8B-Hybrid-GGUF

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Quantized
(116)
this model