Granite-4.0-H-Tiny — MLX 3-bit (Apple Silicon)

Maintainer / Publisher: Susant Achary

This repository provides an Apple-Silicon-optimized MLX build of IBM Granite-4.0-H-Tiny with 3-bit weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
Granite 4.0 is IBM’s latest hybrid Mamba-2/Transformer family with selective Mixture-of-Experts (MoE), designed for long-context, hyper-efficient inference and enterprise use. :contentReference[oaicite:0]{index=0}


🔎 What’s Granite 4.0?

  • Architecture. Hybrid Mamba-2 + softmax attention; H variants add MoE routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
  • Efficiency claims. Up to ~70% lower memory and ~2× faster inference vs. comparable models, especially for multi-session and long-context scenarios. :contentReference[oaicite:2]{index=2}
  • Context window. 128k tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
  • Licensing. Apache-2.0 for public/commercial use. :contentReference[oaicite:4]{index=4}

This MLX build targets Granite-4.0-H-Tiny (≈ 7B total, ≈ 1B active parameters). For reference, the family also includes H-Small (≈32B total / 9B active) and Micro/Micro-H (≈3B dense/hybrid) tiers. :contentReference[oaicite:5]{index=5}


📦 What’s in this repo (MLX format)

  • config.json (MLX), mlx_model*.safetensors (3-bit shards), tokenizer files, and processor metadata.
  • Ready for macOS on M-series chips via Metal/MPS.

The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: ibm-granite/granite-4.0-h-tiny. :contentReference[oaicite:6]{index=6}


✅ Intended use

  • General instruction-following and chat with long context (128k). :contentReference[oaicite:7]{index=7}
  • Enterprise assistant patterns (function calling, structured outputs) and RAG backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
  • On-device development on Macs (MLX), low-latency local prototyping and evaluation.

⚠️ Limitations

  • As a quantized, decoder-only LM, it can produce confident but wrong outputs—review for critical use.
  • 2–4-bit quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
  • Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}

🧠 Model family at a glance

Tier Arch Params (total / active) Notes
H-Small Hybrid + MoE ~32B / 9B Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10}
H-Tiny (this repo) Hybrid + MoE ~7B / 1B Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11}
Micro / H-Micro Dense / Hybrid ~3B Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12}

Context Window: up to 128k tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
License: Apache-2.0. :contentReference[oaicite:14]{index=14}


🧪 Observed on-device behavior (MLX)

Empirically on M-series Macs:

  • 3-bit often gives crisp, direct answers with good latency and modest RAM.
  • Higher bit-widths (4/5/6-bit) improve faithfulness on fine-grained tasks (tiny OCR, structured parsing), at higher memory cost.

Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.


🔢 Choosing a quantization level (Apple Silicon)

Variant Typical Peak RAM (7B-class) Relative speed Typical behavior When to choose
2-bit ~3–4 GB 🔥🔥🔥🔥 Smallest footprint; most lossy Minimal RAM devices / smoke tests
3-bit (this build) ~5–6 GB 🔥🔥🔥🔥 Direct, concise, great latency Default for local dev on M1/M2/M3/M4
4-bit ~6–7.5 GB 🔥🔥🔥 Better detail retention When you need stronger faithfulness
5-bit ~8–9 GB 🔥🔥☆ Higher fidelity For heavy docs / structured outputs
6-bit ~9.5–11 GB 🔥🔥 Max quality under MLX quant If RAM headroom is ample

Figures are indicative for language-only Tiny (no vision), and will vary with context length and KV cache size.


🚀 Quickstart (CLI — MLX)

# Plain generation (deterministic)
python -m mlx_lm.generate \
  --model <this-repo-id> \
  --prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
  --max-tokens 200 \
  --temperature 0.0 \
  --device mps \
  --seed 0
Downloads last month
160
Safetensors
Model size
869M params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/granite-4.0-h-tiny-3bit-MLX

Quantized
(21)
this model

Collection including mlx-community/granite-4.0-h-tiny-3bit-MLX