Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32
Author: B
Toolkit: mlx-lm
0.26.4
Target: On-device inference on Apple Silicon (MLX) with quality-first 4-bit quantization.
TL;DR
- Weights: 4-bit, group size 32 (GS32)
- Embeddings only: 8-bit (GS32) for input fidelity
- Activations/KV hint:
bfloat16
(per config)
- Why: GS32 reduces quantization error vs GS64; 8-bit embeddings preserve lexical nuance and long-context token identity.
- Trade-off: Slightly more memory and a little slower than plain 4-bit GS64, but steadier instruction-following and fewer “wobble” responses.
What’s special here
Quantization spec
bits: 4
, group_size: 32
for all transformer weights
model.embed_tokens: bits 8, group_size 32
(embeddings in 8-bit)
- Config fields are present in both
quantization
and quantization_config
for HF compatibility.
Rationale
- GS32 vs GS64: Smaller groups mean finer scaling → lower quantization error, especially around attention/MLP outliers.
- 8-bit embeddings: The embedding table dominates early information flow. Keeping it at 8-bit reduces input aliasing, helping with nuanced prompts and longer context stability.
- Back-of-envelope memory impact:
- Vocab 151,936 × dim 2,560 → ~388,956,160 params.
- 8-bit embed ≈ 0.362 GB, 4-bit embed ≈ 0.181 GB → ~0.18 GB 증가.
- Net: still comfortably “lightweight,” just not starved.
Who should use this
- On-device chat where consistency matters more than raw token/sec.
- Tool-use, code hints, or mathy prompts that get flaky under aggressive quantization.
- Mac MLX users who want a smart 4-bit profile without going full 8-bit.
Install & basic use (MLX)
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("brabooObrabo/Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
out = generate(model, tokenizer, prompt=prompt, max_tokens=512, temperature=0.7, top_p=0.9, verbose=True)
print(out)
Suggested generation defaults
- temperature: 0.6–0.8
- top_p: 0.9
- top_k: 40–60
- repeat_penalty: 1.05–1.10
Tune as usual; GS32 + 8-bit embeddings tends to accept slightly lower temps without sounding robotic.
Practical notes
- KV cache: By default, activations/KV use bf16 (per config hint). For very long contexts, watch memory and consider runtime KV-cache strategies.
- Context length: Respect the base model’s practical limits; rope params in config don’t magically grant 260k tokens.
- Speed: Expect ~5–15% slower decode vs GS64 all-4bit on the same hardware, with fewer oddities in multi-step reasoning.