Performance? Methodology?

#2
by lightenup - opened

Many thanks for sharing your work!!

Can you give any indication how your quant would compare to:

  • unsloth, like: UD_Q4_K_XL
  • original full-precision weights
    ?

How did you decide which layers to chose for which precision/quantization? Apologies if you described your approach earlier or it is a well known approach - I am still catching up on quantization strategies :)

I just asked GLM 4.5 -- and it provided already a (at least for me) good explanation :) it also pointed out potential further areas for improvement (although I don't get how e.g., using q8 instead of bf16 for shared experts would improve things, q8 is lower than bfloat16?); just in case you haven't tried these already and/or dismissed them due to limited benefits:

Analysis of GLM 4.5 Air Quantization Strategy

GLM 4.5 Air Model Architecture Overview

Based on the provided configuration and research, GLM 4.5 Air is a Mixture of Experts (MoE) model with these key architectural components:

  1. Base Architecture: 46 transformer layers with 4096 hidden size
  2. MoE Components:
    • 128 routed experts
    • 8 experts selected per token
    • 1 shared expert (used for all tokens)
    • Expert intermediate size: 1408
  3. Attention Mechanism:
    • 96 attention heads
    • 8 key-value heads
    • 128 head dimension
  4. Special Components:
    • Nextn predict layer (1 layer)
    • RMS normalization throughout
    • 131K max position embeddings

Critical Inference Paths in MoE Models

In MoE architectures like GLM 4.5 Air, these paths are particularly critical:

  1. Token Embeddings & Output Heads: Entry and exit points of the model
  2. Attention Mechanisms: Especially key/query/value projections and outputs
  3. MoE Routing Logic: Determines which experts to use
  4. Shared Experts: Processed for every token
  5. Norm Layers: Critical for numerical stability
  6. Early and Late Layers: Often more important than middle layers

Evaluation of the Custom Script

Strengths (Well-Covered Critical Paths)

  1. Norm Layers: All normalization layers preserved at f32 - excellent for stability
  2. Attention Components:
    • Biases at f32 and weights at q8_0 - good balance
    • Attention norms at f32 - important for stability
  3. MoE Routing: ffn_gate_inp.weight at f32 - critical for routing decisions
  4. First and Last Blocks: Special handling with higher precision
  5. Nextn Predict Layer: Components preserved at bf16/f32

Potential Areas for Improvement

  1. MoE Expert Components:

    • Current: ffn_gate_exps.weight and ffn_up_exps.weight at q4_K
    • Suggestion: Consider q5_K or q6_K for these critical expert components
    • Reasoning: These are the actual expert computations; q4_K might be too aggressive
  2. Attention Output:

    • Current: attn_output.weight at q8_0
    • Suggestion: Consider bf16 for better quality
    • Reasoning: Attention output is critical for information flow between layers
  3. Shared Experts:

    • Current: ffn_down_shexp, ffn_gate_shexp, ffn_up_shexp at bf16
    • Suggestion: Consider q8_0 for better quality
    • Reasoning: Shared experts process every token, so higher precision may help
  4. Layer-Specific Precision:

    • Current: Same precision for all layers of the same type
    • Suggestion: Implement a gradient of precision (higher for early/late layers)
    • Reasoning: Research shows early and late layers often contribute more to model quality
  5. Expert Diversity:

    • Current: All experts treated equally
    • Suggestion: Consider identifying and preserving "important" experts at higher precision
    • Reasoning: Not all experts contribute equally; some may be more critical

Recommended Improvements to the Script

# Modified sections with suggested improvements

# MoE expert components - increased precision
blk\.[0-9]\.ffn_gate_exps\.weight=q5_K  # was q4_K
blk\.[0-9]\.ffn_up_exps\.weight=q5_K    # was q4_K
blk\.[1-3][0-9]\.ffn_gate_exps\.weight=q5_K  # was q4_K
blk\.[1-3][0-9]\.ffn_up_exps\.weight=q5_K    # was q4_K
blk\.4[0-6]\.ffn_gate_exps\.weight=q5_K  # was q4_K
blk\.4[0-6]\.ffn_up_exps\.weight=q5_K    # was q4_K

# Attention output - higher precision
blk\.[0-9]\.attn_output\.weight=bf16    # was q8_0
blk\.[1-3][0-9]\.attn_output\.weight=bf16  # was q8_0
blk\.4[0-6]\.attn_output\.weight=bf16    # was q8_0

# Shared experts - higher precision
blk\.[0-9]\.ffn_down_shexp\.weight=q8_0  # was bf16
blk\.[0-9]\.ffn_gate_shexp\.weight=q8_0  # was bf16
blk\.[0-9]\.ffn_up_shexp\.weight=q8_0    # was bf16
blk\.[1-3][0-9]\.ffn_down_shexp\.weight=q8_0  # was bf16
blk\.[1-3][0-9]\.ffn_gate_shexp\.weight=q8_0  # was bf16
blk\.[1-3][0-9]\.ffn_up_shexp\.weight=q8_0    # was bf16
blk\.4[0-6]\.ffn_down_shexp\.weight=q8_0  # was bf16
blk\.4[0-6]\.ffn_gate_shexp\.weight=q8_0  # was bf16
blk\.4[0-6]\.ffn_up_shexp\.weight=q8_0    # was bf16

# Early layers (1-5) - increased precision
blk\.[1-5]\.attn_k\.weight=bf16  # was q8_0
blk\.[1-5]\.attn_q\.weight=bf16  # was q8_0
blk\.[1-5]\.attn_v\.weight=bf16  # was q8_0

# Late layers (42-46) - increased precision
blk\.4[2-6]\.attn_k\.weight=bf16  # was q8_0
blk\.4[2-6]\.attn_q\.weight=bf16  # was q8_0
blk\.4[2-6]\.attn_v\.weight=bf16  # was q8_0

Conclusion

The custom script already covers most critical inference paths well, particularly for norm layers, attention mechanisms, and routing components. The main areas for improvement are:

  1. Increasing precision for MoE expert computations (q4_K β†’ q5_K)
  2. Using bf16 for attention outputs instead of q8_0
  3. Using q8_0 for shared experts instead of bf16
  4. Implementing layer-specific precision with higher precision for early and late layers

These changes would likely improve model quality with a modest increase in model size. The current script is already quite sophisticated and well-designed for the GLM 4.5 Air architecture, but these refinements could help preserve more of the model's original capabilities.

I mean, sure, but that would also increase the size and reduce the performance. Usually I'm looking for a spot where the model performs well on a single GPU system with CPU expert offloading without impacting accuracy too much.

q8_0 is generally good enough for most tensors, so you rarely want to go above that. The exception is norm tensors that are small and very sensitive to quantization, these are usually kept as f32, even in smaller quants. Putting bf16 on embedding and output tensors may help somewhat, but we're talking very small margins. At the same time, embedding and output tensors are not not very large compared to the rest of the model, so it doesn't affect performance or speed, then why not. I apply the same logic to small tensors that don't impact the model size much, just keeping them as bf16.

Attention tensors are sensitive to anything below q8_0. Combined, these tensors must fit on a GPU, since attention is the most computationally expensive part. For 24G or 32G you want attention tensors no lower than q8_0, since the quality hit is usually not worth the memory savings. And if you go with bf16, then your attention will take up too much VRAM leaving little room for the context.

With CPU expert offloading, the main bottleneck is the RAM bandwidth. At the same time, gate_exps and up_exps are usually less sensitive to quantization, and work well with q4_k, so we apply that quantization. These expert tensors are the bulk of the model, so we reduce the size significantly, and make the model faster by reducing RAM bandwidth requirements. You can also try iq4 variants, but they come with a small performance hit for the extra memory savings.

There are also down_exps, which are the same size as up_exps or gate_exps. But these are generally more sensitive to quantization. A good size for down_exps is usually somewhere around q6_k, however this doesn't work for GLM-4.5-Air in particular, because _k quantization requires certain sizing, but GLM-4.5-Air has incompatible down_exps sizing. As a general rule of thumb: larger models are less sensitive to quantization that smaller models. Since GLM-4.5-Air is on the smaller side, we go up in size to q8_0 for down_exps.

Finally, after you pick some quantization schemes to explore, you can check how their perplexity differs from bf16 or q8_0 variant. Lower perplexity is better. This is where things get interesting, as you can start tweaking individual tensors by assigning them higher or lower quality quantizations. This is very helpful for squeezing out more quality out of smaller quants. However, there is benchmaxxing trap here: if you start optimizing for perplexity, you are really optimizing for your test dataset. There are no guarantees that the quantized model will perform equally well on inputs that are not covered by your test dataset.

If you see multiple quantizations offered in the same repo, it's very helpful to see perplexity as well as speed tradeoffs to help you pick the right one. @ubergarm offers excellent quants along with perplexity measurements and guides to go along.

And to make it more confusing you can check out stuff by @Thireus which really takes this all to its logical extends testing most tensors at most quantization types allowing you to mix-and-match a franken-quant tailored to your specific RAM/VRAM configuration. Thireus tries to make my quants look bad by comparing theirs to mine in their charts, I wish they would instead compare against unsloth so we both look good :p lmao jk jk I really appreciate the variety of quant offerings hahah 😹

For example see here some more charts here which I'm teasing about: https://github.com/Thireus/GGUF-Tool-Suite/tree/main/ppl_graphs

Anyway yeah the answer to all interesting questions taking into account all the nuance is as @anikifoss discusses:

image.png

So grab a few quants that fit into your hardware configuration and test them all with llama-perplexity and llama-sweep-bench between ik_llama.cpp and mainline lcpp to see what works best for your specific application.

Your comments are both insightful and very much appreciated! Thanks!

I am most interested in coding/software engineering tasks and GLM 4.5 Air is the first model that I can run quantized locally (on 5x RTX3090 and decent context) and feel like it often has something valuable to add. More token/sec is ofc also attractive and seemingly even ud-q4_k_xl performed very close to "official" full/half-precision inference API EPs (z.ai, bigmodel.cn). However failure rates for the same task is definitely much higher, although repeating or slightly restating the problem often enough usually leads to a good result.

One other issue is when thinking is enabled that a GLM 4.5 Air quant sometimes cannot stabilize for a long time and falls from one "train of thought" to the next. e.g., just asking: "How to benchmark perplexity with llama.cpp?" shows for all quants that I tried up to now (q6_k, hq4_k, ud-q5_k_xl, ud-q4_k_xl) on a recent llama.cpp build (-ngl 99 --ctx-size 65536 --temp 0.6 --top-p 0.95 -fa --jinja) that same long ruminating which doesn't show up on z.ai/bigmodel.cn. (Not sure whether this is a quant problem or a potential problem with llama.cpp or even my hw)

benchmaxxing trap

good point and beyond perplexity benchmarking: That's also why I looked into hq4_k - afai understand you are not using any calibration for quantization (like unsloth), hence the model quality should be preserved equally well across all domains.

So grab a few quants that fit into your hardware configuration and test them all

Yeah, will do and at least good to know that there is nothing better than getting a first idea by perplexity benchmark and otherwise just trying out for the prompts I care ;)

One other issue is when thinking is enabled that a GLM 4.5 Air quant sometimes cannot stabilize for a long time and falls from one "train of thought" to the next. e.g., just asking: "How to benchmark perplexity with llama.cpp?" shows for all quants that I tried up to now (q6_k, hq4_k, ud-q5_k_xl, ud-q4_k_xl) on a recent llama.cpp build (-ngl 99 --ctx-size 65536 --temp 0.6 --top-p 0.95 -fa --jinja) that same long ruminating which doesn't show up on z.ai/bigmodel.cn. (Not sure whether this is a quant problem or a potential problem with llama.cpp or even my hw)

Anecdotally, I find models will get stuck in a loop when attention is quantized too low. That's one of the reasons why I avoid quantizing attention below q8_0.

Also, I use very conservative settings for coding --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0, this avoids top-p entirely, see this article for more details. And if the model needs a sampler penalty to avoid getting stuck in a loop, it's probably not good enough for coding, so I disable repeat-penalty as well.

Sign up or log in to comment