Granite-4.0-H-Tiny — MLX 3-bit (Apple Silicon)

Maintainer / Publisher: Susant Achary

This repository provides an Apple-Silicon-optimized MLX build of IBM Granite-4.0-H-Tiny with 3-bit weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
Granite 4.0 is IBM’s latest hybrid Mamba-2/Transformer family with selective Mixture-of-Experts (MoE), designed for long-context, hyper-efficient inference and enterprise use. :contentReference[oaicite:0]{index=0}

🔎 What’s Granite 4.0?

Architecture. Hybrid Mamba-2 + softmax attention; H variants add MoE routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
Efficiency claims. Up to ~70% lower memory and ~2× faster inference vs. comparable models, especially for multi-session and long-context scenarios. :contentReference[oaicite:2]{index=2}
Context window. 128k tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
Licensing. Apache-2.0 for public/commercial use. :contentReference[oaicite:4]{index=4}

This MLX build targets Granite-4.0-H-Tiny (≈ 7B total, ≈ 1B active parameters). For reference, the family also includes H-Small (≈32B total / 9B active) and Micro/Micro-H (≈3B dense/hybrid) tiers. :contentReference[oaicite:5]{index=5}

📦 What’s in this repo (MLX format)

config.json (MLX), mlx_model*.safetensors (3-bit shards), tokenizer files, and processor metadata.
Ready for macOS on M-series chips via Metal/MPS.

The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: ibm-granite/granite-4.0-h-tiny. :contentReference[oaicite:6]{index=6}

✅ Intended use

General instruction-following and chat with long context (128k). :contentReference[oaicite:7]{index=7}
Enterprise assistant patterns (function calling, structured outputs) and RAG backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
On-device development on Macs (MLX), low-latency local prototyping and evaluation.

⚠️ Limitations

As a quantized, decoder-only LM, it can produce confident but wrong outputs—review for critical use.
2–4-bit quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}

🧠 Model family at a glance

Tier	Arch	Params (total / active)	Notes
H-Small	Hybrid + MoE	~32B / 9B	Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10}
H-Tiny (this repo)	Hybrid + MoE	~7B / 1B	Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11}
Micro / H-Micro	Dense / Hybrid	~3B	Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12}

Context Window: up to 128k tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
License: Apache-2.0. :contentReference[oaicite:14]{index=14}

🧪 Observed on-device behavior (MLX)

Empirically on M-series Macs:

3-bit often gives crisp, direct answers with good latency and modest RAM.
Higher bit-widths (4/5/6-bit) improve faithfulness on fine-grained tasks (tiny OCR, structured parsing), at higher memory cost.

Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.

🔢 Choosing a quantization level (Apple Silicon)

Variant	Typical Peak RAM (7B-class)	Relative speed	Typical behavior	When to choose
2-bit	~3–4 GB	🔥🔥🔥🔥	Smallest footprint; most lossy	Minimal RAM devices / smoke tests
3-bit (this build)	~5–6 GB	🔥🔥🔥🔥	Direct, concise, great latency	Default for local dev on M1/M2/M3/M4
4-bit	~6–7.5 GB	🔥🔥🔥	Better detail retention	When you need stronger faithfulness
5-bit	~8–9 GB	🔥🔥☆	Higher fidelity	For heavy docs / structured outputs
6-bit	~9.5–11 GB	🔥🔥	Max quality under MLX quant	If RAM headroom is ample

Figures are indicative for language-only Tiny (no vision), and will vary with context length and KV cache size.

🚀 Quickstart (CLI — MLX)

# Plain generation (deterministic)
python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
  --max-tokens 200 \
  --temperature 0.0 \
  --device mps \
  --seed 0