tags:
- quantization
- kronecker
- second-order
- YAQA
- LLaMA
- Qwen
- efficient
โก FastKronQuantization
Fast second-order Kronecker-factored quantization for LLMs
๐ง Abstract
Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation โ the Kronecker factors required for target quality are obtained with 10ร fewer tokens and approximately a 10ร speedup over the original work.
๐งญ Checkpoints
Model name |
Architecture |
Bits |
FastKronQuant-LLaMA2-7B-4bit |
LLaMA-2-7B |
4-bit |
FastKronQuant-LLaMA3-8B-4bit |
LLaMA-3-8B-Instruct |
4-bit |
FastKronQuant-Qwen3-8B-4bit |
Qwen-3-8B |
4-bit |
FastKronQuant-LLaMA2-7B-2bit |
LLaMA-2-7B |
2-bit |
FastKronQuant-LLaMA3-8B-2bit |
LLaMA-3-8B-Instruct |
2-bit |
FastKronQuant-Qwen3-8B-2bit |
Qwen-3-8B |
2-bit |
Each checkpoint is fully compatible with Hugging Face transformers
and can be loaded like any standard model.
๐ Features
- โก Fast Kronecker decomposition โ up to 10ร faster factor estimation
- ๐งฎ Second-order quantization โ preserves model accuracy
- ๐ชถ Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
- ๐ธ Compatible with ๐ค
transformers
out of the box
๐ Usage Example (LLaMA-2 7B)
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "username/FastKronQuant-LLaMA2-7B"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐งช Example โ ARC Easy evaluation
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)
for i in range(3):
q = ds[i]["question"]
a = pipe(q, max_new_tokens=32)[0]["generated_text"]
print(f"Q: {q}\nA: {a}\n---")
If you use model chekpoints in your experiments, please cite
@misc{chekalina2025gfwsvd,
title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models},
author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov},
year={2025},
eprint={2505.17974},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.17974},
}
๐ Zero-shot results โ LLaMA-3 8B
๐ก 4-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
16 bit (baseline) |
โ |
0.5171 |
0.8409 |
0.7986 |
0.8177 |
0.5908 |
0.7131 |
โ |
โ |
4-bit Sketch A |
4096 |
0.5136 |
0.8443 |
0.7997 |
0.8198 |
0.5865 |
0.7127 |
92 |
16 M |
4-bit FastKron |
75 |
0.5116 |
0.8438 |
0.8025 |
0.8207 |
0.5863 |
0.7129 |
9.5 |
712 K |
4-bit No Hess |
โ |
0.5119 |
0.8415 |
0.7959 |
0.8097 |
0.5859 |
0.7112 |
โ |
โ |
๐ 2-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
2-bit Sketch A |
4096 |
0.4312 |
0.7567 |
0.7647 |
0.7391 |
0.5259 |
0.6435 |
92 |
16 M |
2-bit FastKron |
100 |
0.4277 |
0.7646 |
0.7661 |
0.7468 |
0.5159 |
0.6442 |
11.5 |
950 K |
2-bit No Hess |
โ |
0.2363 |
0.6336 |
0.6554 |
0.5108 |
0.3620 |
0.5094 |
โ |
โ |
๐ Zero-shot results โ Qwen-3 8B
๐ก 4-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
16 bit (baseline) |
โ |
0.5563 |
0.8682 |
0.7677 |
0.8354 |
0.5708 |
0.7197 |
โ |
โ |
4-bit Sketch A |
4096 |
0.5503 |
0.8611 |
0.7612 |
0.8324 |
0.5601 |
0.7132 |
84 |
8 M |
4-bit FastKron |
150 |
0.5469 |
0.8667 |
0.7601 |
0.8287 |
0.5637 |
0.7132 |
42 |
712 K |
4-bit No Hess |
โ |
0.5467 |
0.8675 |
0.7622 |
0.8312 |
0.5585 |
0.7132 |
โ |
โ |
๐ 2-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
2-bit Sketch A |
4096 |
0.4536 |
0.7782 |
0.7435 |
0.7797 |
0.4611 |
0.6432 |
84 |
8 M |
2-bit FastKron |
150 |
0.4616 |
0.8416 |
0.7334 |
0.7702 |
0.4853 |
0.6584 |
42 |
712 K |
2-bit No Hess |
โ |
0.3993 |
0.8675 |
0.7743 |
0.7003 |
0.4758 |
0.6434 |
โ |
โ |
๐ Zero-shot results โ LLaMA-2 7B
๐ก 4-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
16 bit (baseline) |
โ |
0.4325 |
0.7767 |
0.7774 |
0.7617 |
0.5721 |
0.6640 |
โ |
โ |
4-bit Sketch A |
4096 |
0.4274 |
0.7688 |
0.7752 |
0.7613 |
0.5672 |
0.6599 |
50 |
16 M |
4-bit FastKron |
75 |
0.4283 |
0.7792 |
0.7802 |
0.7610 |
0.5660 |
0.6629 |
5 |
712 K |
4-bit No Hess |
โ |
0.4352 |
0.7875 |
0.7742 |
0.7609 |
0.5628 |
0.6641 |
โ |
โ |
๐ 2-bit Quantization
Method |
Steps |
ARC_c โ |
BoolQ โ |
PIQA โ |
ARC_e โ |
HSwag โ |
AVG โ |
GPU/h โ |
Tokens โ |
2-bit Sketch A |
4096 |
0.3805 |
0.7333 |
0.7562 |
0.7192 |
0.5227 |
0.6223 |
50 |
16 M |
2-bit FastKron |
150 |
0.3843 |
0.7510 |
0.7600 |
0.7112 |
0.5139 |
0.6240 |
6 |
1400 K |
2-bit No Hess |
โ |
0.2210 |
0.6355 |
0.6306 |
0.5152 |
0.3422 |
0.4689 |
โ |
โ |