arxiv:2507.04610

any4: Learned 4-bit Numeric Representation for LLMs

Published on Jul 7

· Submitted by

melhoushi on Jul 9

Upvote

Authors:

Mostafa Elhoushi ,

Abstract

any4 is a learned 4-bit weight quantization method for LLMs that achieves high accuracy without preprocessing and uses a GPU-efficient lookup table strategy.

AI-generated summary

We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

View arXiv page View PDF GitHub 20 Add to collection

Community

melhoushi

Paper author Paper submitter about 10 hours ago

We introduce any4, a new 4-bit weight quantization method for large language models that replaces fixed codebooks like int4, fp4, or nf4 with a learned lookup table (LUT) for each row of the weight matrix. This per-row flexibility allows any4 to map 4-bit codes to arbitrary floating-point values, significantly improving quantization accuracy across models such as Llama 2/3, Mistral, and Mixtral. any4 is calibration-efficient, requiring minimal data and no outlier handling, while still matching or outperforming more complex approaches like GPTQ and AWQ.

We also release tinygemm, a GPU-optimized library for low-latency inference not just for any4, but int4, int8, and nf4:
https://github.com/facebookresearch/any4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.04610 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.04610 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.04610 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.