Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32

Author: B
Toolkit: mlx-lm 0.26.4
Target: On-device inference on Apple Silicon (MLX) with quality-first 4-bit quantization.

TL;DR

Weights: 4-bit, group size 32 (GS32)
Embeddings only: 8-bit (GS32) for input fidelity
Activations/KV hint: bfloat16 (per config)
Why: GS32 reduces quantization error vs GS64; 8-bit embeddings preserve lexical nuance and long-context token identity.
Trade-off: Slightly more memory and a little slower than plain 4-bit GS64, but steadier instruction-following and fewer “wobble” responses.

What’s special here

Quantization spec

bits: 4, group_size: 32 for all transformer weights
model.embed_tokens: bits 8, group_size 32 (embeddings in 8-bit)
Config fields are present in both quantization and quantization_config for HF compatibility.

Rationale

GS32 vs GS64: Smaller groups mean finer scaling → lower quantization error, especially around attention/MLP outliers.
8-bit embeddings: The embedding table dominates early information flow. Keeping it at 8-bit reduces input aliasing, helping with nuanced prompts and longer context stability.
Back-of-envelope memory impact:
- Vocab 151,936 × dim 2,560 → ~388,956,160 params.
- 8-bit embed ≈ 0.362 GB, 4-bit embed ≈ 0.181 GB → ~0.18 GB 증가.
- Net: still comfortably “lightweight,” just not starved.

Who should use this

On-device chat where consistency matters more than raw token/sec.
Tool-use, code hints, or mathy prompts that get flaky under aggressive quantization.
Mac MLX users who want a smart 4-bit profile without going full 8-bit.

Install & basic use (MLX)

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("brabooObrabo/Qwen3-4B-Instruct-2507-MLX-4bit-GS32-embed-8bit-GS32")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

out = generate(model, tokenizer, prompt=prompt, max_tokens=512, temperature=0.7, top_p=0.9, verbose=True)
print(out)

Suggested generation defaults

temperature: 0.6–0.8
top_p: 0.9
top_k: 40–60
repeat_penalty: 1.05–1.10

Tune as usual; GS32 + 8-bit embeddings tends to accept slightly lower temps without sounding robotic.

Practical notes

KV cache: By default, activations/KV use bf16 (per config hint). For very long contexts, watch memory and consider runtime KV-cache strategies.
Context length: Respect the base model’s practical limits; rope params in config don’t magically grant 260k tokens.
Speed: Expect ~5–15% slower decode vs GS64 all-4bit on the same hardware, with fewer oddities in multi-step reasoning.