steampunque/gemma-3-4b-it-Hybrid-GGUF

Llama.cpp hybrid layer quantization of gemma-3-4b-it by Google

Original model: https://huggingface.co/google/gemma-3-4b-it

The hybrid quant employs different quantization levels on a per layer basis to increased flexibility of trading off performance vs file size. Less parameter bits are used at deep layers and more bits at cortex layers to simultaneously optimize quantized size and model performance. This quant was designed to approximately match Q6_K size with improved performance while using all K-quants for faster CPU processing when partially offloaded. For this file the layer quants are as follows:

LAYER_TYPES='[
   [0 ,"Q8_0"  ],[1 ,"Q6_K"  ],[2 ,"Q5_K_M"],[3 ,"Q5_K_S"],[4 ,"Q5_K_M"],[5 ,"Q5_K_M"],
   [6 ,"Q5_K_M"],[7 ,"Q5_K_M"],[8, "Q6_K"  ],[9, "Q5_K_M"],[10,"Q6_K"  ],[11,"Q5_K_M"],
   [12,"Q6_K"  ],[13,"Q5_K_M"],[14,"Q6_K"  ],[15,"Q5_K_M"],[16,"Q6_K"  ],[17,"Q5_K_M"],
   [18,"Q6_K"  ],[19,"Q6_K"  ],[20,"Q6_K"  ],[21,"Q6_K"  ],[22,"Q6_K"  ],[23,"Q6_K"  ],
   [24,"Q8_0"  ],[25,"Q6_K"  ],[26,"Q8_0"  ],[27,"Q6_K"  ],[28,"Q8_0"  ],[29,"Q6_K"  ],
   [30,"Q8_0"  ],[31,"Q6_K"  ],[32,"Q8_0"  ],[33,"Q8_0"  ]
   ]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K"

This quant was optimized for high reasoning + knowledge performance across a range of test prompts.

Comparison:

Quant	size	PPL	Comment
Q6_K	3.2e9	14.3	default embed and output
Q6_K_H	3.2e9	14.5	Q6_K embed Q6_K output

Usage:

gemma-3 4b is a vision capable model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository. To test vision mode follow the docs in the mtmd readme in the tools directory of the source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md .

The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode. If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used. When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger).

Benchmarks:

A full set of benchmarks for both text and vision mode will eventually be provided here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link	Type	Size/e9 B	Notes
gemma-3-4b-it.Q6_K_H.gguf	Q6_K_H	3.2e9 B	~Q6_K size
gemma-3-4b-it.mmproj.gguf	mmproj	0.85e9 B	multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

steampunque
/

gemma-3-4b-it-Hybrid-GGUF

Llama.cpp hybrid layer quantization of gemma-3-4b-it by Google

Download the file from below:

Model tree for steampunque/gemma-3-4b-it-Hybrid-GGUF