stduhpf's picture
Update README.md
58686fb verified
|
raw
history blame
1.83 kB
metadata
license: gemma
metrics:
  - perplexity
base_model:
  - google/gemma-3-1b-it-qat-q4_0-gguf
  - bartowski/google_gemma-3-1b-it-GGUF

This is a "self" merge of https://huggingface.co/google/gemma-3-1b-it-qat-q4_0-gguff.

The official QAT weights released by google use fp16 (instead of Q6_K) for the embeddings table, which makes this model take a significant extra amount of memory (and storage) compared to what Q4_0 quants are supposed to take. Instead of quantizing the table myself, I extracted it from Bartowski's quantized models, because those were already calibrated with imatrix, which should squeeze some extra performance out of it. Requantizing with llama.cpp fixes that and gives better result than the other thing.

Here are some perplexity measurements:

Model File size ↓ PPL (wiki.text.raw) ↓
This model 720 MB 28.0468 +/- 0.26681
This model (older version) 720 MB 28.2603 +/- 0.26947
Q4_0 (bartowski) 722 MB 34.4906 +/- 0.34539
QAT Q4_0 (google) 1 GB 28.0400 +/- 0.26669
BF16 (upscaled to f32 for faster inference) 2 GB 29.1129 +/- 0.28170

Note that this model ends up smaller than the Q4_0 from Bartowski. This is because llama.cpp sets some tensors to Q4_1 when quantizing models to Q4_0 with imatrix, but this is a static quant.