🌳 QAT: The Art of Growing a Bonsai Model

Community Article Published November 9, 2025

The Achievement

Kimi K2 Thinking demonstrates something remarkable: a 1 trillion parameter reasoning model running at INT4 precision that achieves SOTA. The model achieves 2Γ— generation speed improvement with "lossless" accuracyβ€”all while each weight is represented by just 16 possible values.

At first glance, this seems impossible. How can you represent the full expressiveness of a neural network weight using only 4 bits? The answer lies in Quantization-Aware Training (QAT), and understanding why reveals fundamental insights about how neural networks really work.

Why 16 Values Can Hold the State for Each Weight

The Naive View (Wrong)

Most people think about quantization like this:

Original FP16 weight: 0.00347182...
↓ (naive rounding)
INT4 value: ???
Problem: How do we fit infinite precision into 16 buckets?

This seems like a lossy compression nightmare. And with naive quantization, it is.

The Reality: It's About Distribution, Not Precision

Here's what actually matters for a neural network weight:

What the weight needs to represent:

  • Not: "exactly 0.00347182"
  • But: "this connection should be slightly positive and weak"

Neural networks don't care about absolute precisionβ€”they care about relative importance and patterns across many weights.

Visualizing Weight Distribution

Here's what the weight distribution looks like in a typical neural network layer:

Frequency
    β”‚
    β”‚     β•±β•²
    β”‚    β•±  β•²           Most weights cluster
    β”‚   β•±    β•²          around zero
    β”‚  β•±      β•²
    β”‚ β•±        β•²
    β”‚β•±          β•²___________
    └──────────────────────── Weight value
   -0.5    0    0.5

Key insight: Most weights are small, with a few large outliers. We don't need uniform precisionβ€”we need more precision where weights cluster.

INT4 Quantization: Strategic Value Placement

Here's how INT4 quantization actually works:

Step 1: Find the scale and zero-point
──────────────────────────────────
FP16 range: [-0.47, 0.53]
Map to INT4 range: [0, 15]

Scale = (0.53 - (-0.47)) / 15 = 0.0667
Zero-point = 7


Step 2: Define the 16 quantization levels
─────────────────────────────────────────
INT4     Actual FP16
Value    Value
────────────────────
 0   β†’   -0.467
 1   β†’   -0.400
 2   β†’   -0.333
 3   β†’   -0.267
 4   β†’   -0.200
 5   β†’   -0.133
 6   β†’   -0.067
 7   β†’    0.000  ← zero-point (most common)
 8   β†’    0.067
 9   β†’    0.133
10   β†’    0.200
11   β†’    0.267
12   β†’    0.333
13   β†’    0.400
14   β†’    0.467
15   β†’    0.533

Each weight gets mapped to its nearest quantization level. The scale and zero-point are chosen per-channel or per-group to maximize representation quality.

Why This Works

Example weight matrix (4Γ—4, showing FP16 values):

Before Quantization (FP16):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  0.023  -0.156   0.401  -0.089  β”‚
β”‚ -0.312   0.067   0.134  -0.445  β”‚
β”‚  0.189  -0.223   0.012   0.356  β”‚
β”‚ -0.078   0.445  -0.267   0.101  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

After INT4 Quantization (storing INT4 values):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    7      5       13       6     β”‚  ← Each value is 0-15
β”‚    3      8        9       1     β”‚
β”‚   10      4        7      12     β”‚
β”‚    6     15        4       9     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reconstructed at inference (FP16):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  0.000  -0.133   0.400  -0.067  β”‚  ← Close to original!
β”‚ -0.267   0.067   0.133  -0.467  β”‚
β”‚  0.200  -0.200   0.000   0.333  β”‚
β”‚ -0.067   0.533  -0.200   0.133  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The magic: For most neural network operations (matrix multiplications, activations), these approximations preserve the essential patterns. The network learns to be robust to this discretization.

QAT vs PTQ

Post-Training Quantization (PTQ): The Naive Approach

PTQ is like trying to fit a fully-grown tree into a small pot:

PTQ (Post-Training Quantization):
═══════════════════════════════════════════════════

Step 1: Train the model normally
────────────────────────────────
[Full precision training]
β”‚
β”‚  Weights learn: "I need exactly 0.00347182 to be accurate"
β”‚
β–Ό
Trained FP16 Model


Step 2: Quantize after training
────────────────────────────────
[Sudden precision loss]
β”‚
β”‚  Weight 0.00347182 β†’ INT4 value 7 β†’ 0.00000000
β”‚  Weight 0.15234 β†’ INT4 value 9 β†’ 0.13333333
β”‚  Weight -0.08234 β†’ INT4 value 6 β†’ -0.06666666
β”‚
β–Ό
INT4 Model (accuracy drops!)


The Problem:
────────────
Network was never trained to be robust to quantization
└─> Accumulated errors over long sequences
    └─> For reasoning models: catastrophic failure

Quantization-Aware Training (QAT): Growing in the Pot

QAT is like growing a bonsai treeβ€”shaped to fit from the start:

QAT (Quantization-Aware Training):
═══════════════════════════════════════════════════

Step 1: Start with pre-trained model
─────────────────────────────────────
[FP16 model from pre-training]


Step 2: Add quantization simulation during training
────────────────────────────────────────────────────
Forward Pass:
    β”‚
    β”œβ”€> FP16 weight: 0.00347182
    β”‚
    β”œβ”€> [Simulate quantization]
    β”‚   └─> Quantize:   0.00347182 β†’ INT4(7) β†’ 0.00000000
    β”‚   └─> Use quantized value in forward pass
    β”‚
    └─> Compute loss with quantized weights

Backward Pass:
    β”‚
    β”œβ”€> Gradient flows through
    β”‚   (using straight-through estimator)
    β”‚
    └─> Update FP16 weight: 0.00347182 β†’ 0.00123456
        (But future forward passes will quantize it)


Step 3: Model learns to be robust
──────────────────────────────────
After many iterations:
    β”‚
    β”œβ”€> Weights naturally cluster around quantization levels
    β”œβ”€> Network compensates for quantization errors
    β”œβ”€> Critical weights move to "stable" quantization points
    β”‚
    └─> Final model performs well even when fully quantized


Step 4: Save as INT4
─────────────────────
[Convert final FP16 weights β†’ INT4]
β”‚
└─> No accuracy loss! Model was trained for this.

The Key Difference

PTQ: "How can I squeeze this model into INT4?"
     └─> Forcing a round peg into a square hole

QAT: "How can I train a model that naturally works well in INT4?"
     └─> Growing a square peg from the start

Why QAT Matters

The Reasoning Model Challenge

Reasoning models have a unique problem:

Prompt: "Solve this math problem..."
   β”‚
   β”œβ”€> Reasoning token 1    (small error: +0.1%)
   β”œβ”€> Reasoning token 2    (error compounds: +0.2%)
   β”œβ”€> Reasoning token 3    (error compounds: +0.3%)
   β”œβ”€> ...
   β”œβ”€> Reasoning token 10,000  (error explodes: +25%!)
   β”‚
   └─> Final answer: WRONG

With PTQ, small quantization errors compound over thousands of reasoning tokens. The model "drifts" off course.

Kimi's Solution: QAT on MoE Components

Architecture:
═══════════════════════════════════════════

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Attention Layers                 β”‚
β”‚         (FP16 or FP8)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     MoE Layer (INT4 with QAT)           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚Expertβ”‚ β”‚Expertβ”‚ ...  β”‚Expertβ”‚       β”‚  ← 384 experts
β”‚  β”‚  1   β”‚ β”‚  2   β”‚      β”‚ 384  β”‚       β”‚  ← Each in INT4
β”‚  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚     ↑         ↑             ↑           β”‚
β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚         Router (selects 8)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why apply QAT specifically to MoE components?

  1. Most parameters are here: 384 experts Γ— parameters per expert
  2. Redundancy: Only 8/384 experts active per tokenβ€”more tolerance for approximation
  3. Bottleneck: MoE layers are memory-bandwidth limited
  4. Biggest win: 4Γ— memory reduction where it matters most

Conclusion: The 4-Bit Wisdom

The remarkable thing about Kimi K2 Thinking isn't just that it uses 4 bits per weight. It's that through QAT, the model learns to live in a 16-value-per-weight world.

The two key insights:

  1. 16 values are enough because neural networks have massive redundancy, and with the right scale/zero-point per group, those 16 values can be strategically placed where the weight distribution actually lives.

  2. QAT vs PTQ isn't just about when you quantizeβ€”it's about teaching the model to be robust to quantization from the start, preventing error accumulation in long reasoning chains.


References

Community

any evaluation of reduced MOE quality for activations wint this model?

Article author

Not from Kimi report but discussed in MoQE paper (they are on the champion side)
https://huggingface.co/papers/2310.02410

what if you then ptq the qat model? are the usual ptq precision loss problems back, are they worse?

Article author

ptq the qat is INT4 a weight which is already INT4, that'll be a no-op? Is my understanding correct?

Does 8 bit precision for activations reduce accuracy from moe? what's the computational efficiency gains?
I expressed my question the wrong way previously.

Β·
Article author

I believe activations are still FP16/BF16, but there aren't many of them.

All INT4 weights need to stay in RAM, but we only need activations for the activated experts (8 out of 384). Also during the forward pass, activations can be discarded layer after layer (except KV cache).

if we do ptq after qat was done in the int4 there wont be any loss right?

Β·
Article author

I think so. Looks like people really care about this. Let me see if I could get some more clues.

Sign up or log in to comment