KL Divergence as Performance Metric

#5
by joaquinrfs - opened

Perplexity measures the model's predictions and compares them to the ground truth using a dataset. While low quality quants do show increases in perplexity as they start outputting nonsense, some predictions may be closer to the reference dataset than what the original high precision model would've predicted; this reduces perplexity, even though the quant made the model worse.

To correctly measure how a quant altered the original predictions of the model, one must compare the new logits of the quantized model versus the old logits of the unquantized model on the same dataset. In other words, one must measure the KL Divergence.

First run the original model on a dataset and save the logits (to logits.dat for example):

llama-perplexity -m <MODEL> -f <DATASET> --kl-divergence-base logits.dat

Then, run the quantized model on these logits and instruct Llama.cpp to calculate the KLD. One doesn't need to specify the dataset here since it uses the text taken from the logits file:

llama-perplexity -m <MODEL> -f <DATASET> --kl-divergence-base logits.dat --kl-divergence

If you can provide provide KLD stats rather or alongside the perplexity ones, it would be greatly appreciated.

This is an age old quandary, how to measure the relative degradation of a quantized model relative to the full un-quantized weights.

  1. I do provide some KLD stats on the full size version here: https://huggingface.co/ubergarm/GLM-4.5-GGUF#quant-collection
  2. Read this reddit post I wrote: https://www.reddit.com/r/LocalLLaMA/comments/1khwxal/the_great_quant_wars_of_2025/
  3. I have a guide showing how to do KLD stats here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434#discussion-8342773

Generally Perplexity is "good enough" for comparing quants using the same imatrix corpus with the exact same configuration and hardware and test corpus. I don't use wiki.text/test data in my imatrix corpus to do my best to avoid overfitting. So the perplexity values I provide are sufficient for a rough relative quality between the quantized versions.

Full GLM-4.5 did not "behave" and the smaller quants had "better" perplexity than the full bf16 which can be indicative of QAT or other numerical stuff as you mention. That is why I provided some additional KLD stats there. However Air was fairly well behaved with a monotonically decreasing perplexity with decreasing BPW so it should be correlated enough.

Generally just get the most BPW you can fit on your hardware to achieve the target speed goals is about all you can do in terms of quality trade-offs.

If you want to provide the KLD stats yourself with full methodology and corpus listed that would be cool, but no pressure.

Thanks!

Sign up or log in to comment