You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

tags: - quantization - kronecker - second-order - YAQA - LLaMA - Qwen - efficient

⚡ FastKronQuantization

Fast second-order Kronecker-factored quantization for LLMs

🧠 Abstract

Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation — the Kronecker factors required for target quality are obtained with 10× fewer tokens and approximately a 10× speedup over the original work.

🧭 Checkpoints

Model name	Architecture	Bits
`FastKronQuant-LLaMA2-7B-4bit`	LLaMA-2-7B	4-bit
`FastKronQuant-LLaMA3-8B-4bit`	LLaMA-3-8B-Instruct	4-bit
`FastKronQuant-Qwen3-8B-4bit`	Qwen-3-8B	4-bit
`FastKronQuant-LLaMA2-7B-2bit`	LLaMA-2-7B	2-bit
`FastKronQuant-LLaMA3-8B-2bit`	LLaMA-3-8B-Instruct	2-bit
`FastKronQuant-Qwen3-8B-2bit`	Qwen-3-8B	2-bit

Each checkpoint is fully compatible with Hugging Face transformers and can be loaded like any standard model.

📌 Features

⚡ Fast Kronecker decomposition — up to 10× faster factor estimation
🧮 Second-order quantization — preserves model accuracy
🪶 Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
🔸 Compatible with 🤗 transformers out of the box

🚀 Usage Example (LLaMA-2 7B)

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/FastKronQuant-LLaMA2-7B"  # replace with actual repo
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 Example — ARC Easy evaluation

from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load ARC-Easy
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]

# Load quantized model
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)


# Simple evaluation loop
for i in range(3):
    q = ds[i]["question"]
    a = pipe(q, max_new_tokens=32)[0]["generated_text"]
    print(f"Q: {q}\nA: {a}\n---")

If you use model chekpoints in your experiments, please cite

@misc{chekalina2025gfwsvd, title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models}, author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov}, year={2025}, eprint={2505.17974}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.17974}, }

📊 Zero-shot results — LLaMA-3 8B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.5171	0.8409	0.7986	0.8177	0.5908	0.7131	–	–
4-bit Sketch A	4096	0.5136	0.8443	0.7997	0.8198	0.5865	0.7127	92	16 M
4-bit FastKron	75	0.5116	0.8438	0.8025	0.8207	0.5863	0.7129	9.5	712 K
4-bit No Hess	–	0.5119	0.8415	0.7959	0.8097	0.5859	0.7112	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.4312	0.7567	0.7647	0.7391	0.5259	0.6435	92	16 M
2-bit FastKron	100	0.4277	0.7646	0.7661	0.7468	0.5159	0.6442	11.5	950 K
2-bit No Hess	–	0.2363	0.6336	0.6554	0.5108	0.3620	0.5094	–	–

📊 Zero-shot results — Qwen-3 8B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.5563	0.8682	0.7677	0.8354	0.5708	0.7197	–	–
4-bit Sketch A	4096	0.5503	0.8611	0.7612	0.8324	0.5601	0.7132	84	8 M
4-bit FastKron	150	0.5469	0.8667	0.7601	0.8287	0.5637	0.7132	42	712 K
4-bit No Hess	–	0.5467	0.8675	0.7622	0.8312	0.5585	0.7132	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.4536	0.7782	0.7435	0.7797	0.4611	0.6432	84	8 M
2-bit FastKron	150	0.4616	0.8416	0.7334	0.7702	0.4853	0.6584	42	712 K
2-bit No Hess	–	0.3993	0.8675	0.7743	0.7003	0.4758	0.6434	–	–

📊 Zero-shot results — LLaMA-2 7B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.4325	0.7767	0.7774	0.7617	0.5721	0.6640	–	–
4-bit Sketch A	4096	0.4274	0.7688	0.7752	0.7613	0.5672	0.6599	50	16 M
4-bit FastKron	75	0.4283	0.7792	0.7802	0.7610	0.5660	0.6629	5	712 K
4-bit No Hess	–	0.4352	0.7875	0.7742	0.7609	0.5628	0.6641	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.3805	0.7333	0.7562	0.7192	0.5227	0.6223	50	16 M
2-bit FastKron	150	0.3843	0.7510	0.7600	0.7112	0.5139	0.6240	6	1400 K
2-bit No Hess	–	0.2210	0.6355	0.6306	0.5152	0.3422	0.4689	–	–

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Sayankotor
/

FastKronQuantization

You need to agree to share your contact information to access this model

tags: - quantization - kronecker - second-order - YAQA - LLaMA - Qwen - efficient

⚡ FastKronQuantization

🧠 Abstract

🧭 Checkpoints

📌 Features

🚀 Usage Example (LLaMA-2 7B)

🧪 Example — ARC Easy evaluation

📊 Zero-shot results — LLaMA-3 8B

🟡 4-bit Quantization

🟠 2-bit Quantization

📊 Zero-shot results — Qwen-3 8B

🟡 4-bit Quantization

🟠 2-bit Quantization

📊 Zero-shot results — LLaMA-2 7B

🟡 4-bit Quantization

🟠 2-bit Quantization

Datasets used to train Sayankotor/FastKronQuantization