Safetensors

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

tags: - quantization - kronecker - second-order - YAQA - LLaMA - Qwen - efficient

โšก FastKronQuantization

Fast second-order Kronecker-factored quantization for LLMs


๐Ÿง  Abstract

Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation โ€” the Kronecker factors required for target quality are obtained with 10ร— fewer tokens and approximately a 10ร— speedup over the original work.


๐Ÿงญ Checkpoints

Model name Architecture Bits
FastKronQuant-LLaMA2-7B-4bit LLaMA-2-7B 4-bit
FastKronQuant-LLaMA3-8B-4bit LLaMA-3-8B-Instruct 4-bit
FastKronQuant-Qwen3-8B-4bit Qwen-3-8B 4-bit
FastKronQuant-LLaMA2-7B-2bit LLaMA-2-7B 2-bit
FastKronQuant-LLaMA3-8B-2bit LLaMA-3-8B-Instruct 2-bit
FastKronQuant-Qwen3-8B-2bit Qwen-3-8B 2-bit

Each checkpoint is fully compatible with Hugging Face transformers and can be loaded like any standard model.


๐Ÿ“Œ Features

  • โšก Fast Kronecker decomposition โ€” up to 10ร— faster factor estimation
  • ๐Ÿงฎ Second-order quantization โ€” preserves model accuracy
  • ๐Ÿชถ Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
  • ๐Ÿ”ธ Compatible with ๐Ÿค— transformers out of the box

๐Ÿš€ Usage Example (LLaMA-2 7B)

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/FastKronQuant-LLaMA2-7B"  # replace with actual repo
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

๐Ÿงช Example โ€” ARC Easy evaluation

from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load ARC-Easy
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]

# Load quantized model
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)


# Simple evaluation loop
for i in range(3):
    q = ds[i]["question"]
    a = pipe(q, max_new_tokens=32)[0]["generated_text"]
    print(f"Q: {q}\nA: {a}\n---")

If you use model chekpoints in your experiments, please cite

@misc{chekalina2025gfwsvd, title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models}, author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov}, year={2025}, eprint={2505.17974}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.17974}, }

๐Ÿ“Š Zero-shot results โ€” LLaMA-3 8B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.5171 0.8409 0.7986 0.8177 0.5908 0.7131 โ€“ โ€“
4-bit Sketch A 4096 0.5136 0.8443 0.7997 0.8198 0.5865 0.7127 92 16 M
4-bit FastKron 75 0.5116 0.8438 0.8025 0.8207 0.5863 0.7129 9.5 712 K
4-bit No Hess โ€“ 0.5119 0.8415 0.7959 0.8097 0.5859 0.7112 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.4312 0.7567 0.7647 0.7391 0.5259 0.6435 92 16 M
2-bit FastKron 100 0.4277 0.7646 0.7661 0.7468 0.5159 0.6442 11.5 950 K
2-bit No Hess โ€“ 0.2363 0.6336 0.6554 0.5108 0.3620 0.5094 โ€“ โ€“

๐Ÿ“Š Zero-shot results โ€” Qwen-3 8B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.5563 0.8682 0.7677 0.8354 0.5708 0.7197 โ€“ โ€“
4-bit Sketch A 4096 0.5503 0.8611 0.7612 0.8324 0.5601 0.7132 84 8 M
4-bit FastKron 150 0.5469 0.8667 0.7601 0.8287 0.5637 0.7132 42 712 K
4-bit No Hess โ€“ 0.5467 0.8675 0.7622 0.8312 0.5585 0.7132 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.4536 0.7782 0.7435 0.7797 0.4611 0.6432 84 8 M
2-bit FastKron 150 0.4616 0.8416 0.7334 0.7702 0.4853 0.6584 42 712 K
2-bit No Hess โ€“ 0.3993 0.8675 0.7743 0.7003 0.4758 0.6434 โ€“ โ€“

๐Ÿ“Š Zero-shot results โ€” LLaMA-2 7B

๐ŸŸก 4-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
16 bit (baseline) โ€“ 0.4325 0.7767 0.7774 0.7617 0.5721 0.6640 โ€“ โ€“
4-bit Sketch A 4096 0.4274 0.7688 0.7752 0.7613 0.5672 0.6599 50 16 M
4-bit FastKron 75 0.4283 0.7792 0.7802 0.7610 0.5660 0.6629 5 712 K
4-bit No Hess โ€“ 0.4352 0.7875 0.7742 0.7609 0.5628 0.6641 โ€“ โ€“

๐ŸŸ  2-bit Quantization

Method Steps ARC_c โ†‘ BoolQ โ†‘ PIQA โ†‘ ARC_e โ†‘ HSwag โ†‘ AVG โ†‘ GPU/h โ†“ Tokens โ†“
2-bit Sketch A 4096 0.3805 0.7333 0.7562 0.7192 0.5227 0.6223 50 16 M
2-bit FastKron 150 0.3843 0.7510 0.7600 0.7112 0.5139 0.6240 6 1400 K
2-bit No Hess โ€“ 0.2210 0.6355 0.6306 0.5152 0.3422 0.4689 โ€“ โ€“
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train Sayankotor/FastKronQuantization