Granite-4.0-H-Tiny — MLX 3-bit (Apple Silicon)
Maintainer / Publisher: Susant Achary
This repository provides an Apple-Silicon-optimized MLX build of IBM Granite-4.0-H-Tiny with 3-bit weight quantization (plus usage guidance for 2/4/5/6-bit variants if RAM allows).
Granite 4.0 is IBM’s latest hybrid Mamba-2/Transformer family with selective Mixture-of-Experts (MoE), designed for long-context, hyper-efficient inference and enterprise use. :contentReference[oaicite:0]{index=0}
🔎 What’s Granite 4.0?
- Architecture. Hybrid Mamba-2 + softmax attention; H variants add MoE routing (sparse activation). Aims to keep expressivity while dramatically reducing memory footprint. :contentReference[oaicite:1]{index=1}
- Efficiency claims. Up to ~70% lower memory and ~2× faster inference vs. comparable models, especially for multi-session and long-context scenarios. :contentReference[oaicite:2]{index=2}
- Context window. 128k tokens (Tiny/Base preview cards). :contentReference[oaicite:3]{index=3}
- Licensing. Apache-2.0 for public/commercial use. :contentReference[oaicite:4]{index=4}
This MLX build targets Granite-4.0-H-Tiny (≈ 7B total, ≈ 1B active parameters). For reference, the family also includes H-Small (≈32B total / 9B active) and Micro/Micro-H (≈3B dense/hybrid) tiers. :contentReference[oaicite:5]{index=5}
📦 What’s in this repo (MLX format)
config.json
(MLX),mlx_model*.safetensors
(3-bit shards), tokenizer files, and processor metadata.- Ready for macOS on M-series chips via Metal/MPS.
The upstream Hugging Face model cards for Granite 4.0 (Tiny/Small) provide additional training details, staged curricula and alignment workflow. Start here for Tiny: ibm-granite/granite-4.0-h-tiny. :contentReference[oaicite:6]{index=6}
✅ Intended use
- General instruction-following and chat with long context (128k). :contentReference[oaicite:7]{index=7}
- Enterprise assistant patterns (function calling, structured outputs) and RAG backends that benefit from efficient, large windows. :contentReference[oaicite:8]{index=8}
- On-device development on Macs (MLX), low-latency local prototyping and evaluation.
⚠️ Limitations
- As a quantized, decoder-only LM, it can produce confident but wrong outputs—review for critical use.
- 2–4-bit quantization may reduce precision on intricate tasks (math/code, tiny-text parsing); prefer higher bit-widths if RAM allows.
- Follow your organization’s safety/PII/guardrail policies (Granite is “open-weight,” not a full product). :contentReference[oaicite:9]{index=9}
🧠 Model family at a glance
Tier | Arch | Params (total / active) | Notes |
---|---|---|---|
H-Small | Hybrid + MoE | ~32B / 9B | Workhorse for enterprise agent tasks; strong function-calling & instruction following. :contentReference[oaicite:10]{index=10} |
H-Tiny (this repo) | Hybrid + MoE | ~7B / 1B | Long-context, efficiency-first; great for local dev. :contentReference[oaicite:11]{index=11} |
Micro / H-Micro | Dense / Hybrid | ~3B | Edge/low-resource alternatives; when hybrid runtime isn’t optimized. :contentReference[oaicite:12]{index=12} |
Context Window: up to 128k tokens for Tiny/Base preview lines. :contentReference[oaicite:13]{index=13}
License: Apache-2.0. :contentReference[oaicite:14]{index=14}
🧪 Observed on-device behavior (MLX)
Empirically on M-series Macs:
- 3-bit often gives crisp, direct answers with good latency and modest RAM.
- Higher bit-widths (4/5/6-bit) improve faithfulness on fine-grained tasks (tiny OCR, structured parsing), at higher memory cost.
Performance varies by Mac model, image/token lengths, and temperature; validate on your workload.
🔢 Choosing a quantization level (Apple Silicon)
Variant | Typical Peak RAM (7B-class) | Relative speed | Typical behavior | When to choose |
---|---|---|---|---|
2-bit | ~3–4 GB | 🔥🔥🔥🔥 | Smallest footprint; most lossy | Minimal RAM devices / smoke tests |
3-bit (this build) | ~5–6 GB | 🔥🔥🔥🔥 | Direct, concise, great latency | Default for local dev on M1/M2/M3/M4 |
4-bit | ~6–7.5 GB | 🔥🔥🔥 | Better detail retention | When you need stronger faithfulness |
5-bit | ~8–9 GB | 🔥🔥☆ | Higher fidelity | For heavy docs / structured outputs |
6-bit | ~9.5–11 GB | 🔥🔥 | Max quality under MLX quant | If RAM headroom is ample |
Figures are indicative for language-only Tiny (no vision), and will vary with context length and KV cache size.
🚀 Quickstart (CLI — MLX)
# Plain generation (deterministic)
python -m mlx_lm.generate \
--model <this-repo-id> \
--prompt "Summarize the following notes into 5 bullet points:\n<your text>" \
--max-tokens 200 \
--temperature 0.0 \
--device mps \
--seed 0
- Downloads last month
- 160
Model tree for mlx-community/granite-4.0-h-tiny-3bit-MLX
Base model
ibm-granite/granite-4.0-h-tiny