Apriel-1.5-15B-Thinker β€” MLX 3-bit (Apple Silicon)

Format: MLX (Mac, Apple Silicon)
Quantization: 3-bit (balanced footprint ↔ quality)
Base: ServiceNow-AI/Apriel-1.5-15B-Thinker
Architecture: Pixtral-style LLaVA (vision encoder β†’ 2-layer projector β†’ decoder)

This repository provides a 3-bit MLX build of Apriel-1.5-15B-Thinker for on-device multimodal inference on Apple Silicon. In side-by-side tests, the 3-bit variant often:

  • uses significantly less RAM than 6-bit,
  • decodes faster, and
  • tends to produce more direct answers (less β€œthinking out loud”) at low temperature.

If RAM allows, we also suggest trying 4-bit/5-bit/6-bit variants (guidance below) for tasks that demand more fidelity.

Explore other Apriel MLX variants under the mlx-community namespace on the Hub.


πŸ”Ž Upstream β†’ MLX summary

Apriel-1.5-15B-Thinker is a multimodal reasoning VLM built via depth upscaling, two-stage multimodal continual pretraining, and SFT with explicit reasoning traces (math, coding, science, tool-use).
This MLX release converts the upstream checkpoint with 3-bit quantization for smaller memory and quick startup on macOS.


πŸ“¦ Contents

  • config.json (MLX config for Pixtral-style VLM)
  • mlx_model*.safetensors (3-bit shards)
  • tokenizer.json, tokenizer_config.json
  • processor_config.json / image_processor.json
  • model_index.json and metadata

πŸš€ Quickstart (CLI)

Single image caption

python -m mlx_vlm.generate \
  --model <this-repo-id> \
  --image /path/to/image.jpg \
  --prompt "Describe this image in two concise sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0

πŸ”€ Model Family Comparison (2-bit β†’ 6-bit)

TL;DR: Start with 3-bit for the best size↔quality trade-off. If you need finer OCR/diagram detail and have RAM, step up to 4-bit/5-bit. Use 6-bit only when you have headroom and you explicitly instruct concision.

πŸ“Š Quick Comparison

Variant 🧠 Peak RAM* ⚑ Speed (rel.) πŸ—£οΈ Output Style (typical) βœ… Best For ⚠️ Watch Out For
2-bit ~7–8 GB πŸ”₯πŸ”₯πŸ”₯πŸ”₯ Shortest, most lossy Minimal RAM demos, quick triage Detail loss on OCR/dense charts; more omissions
3-bit ~9–10 GB πŸ”₯πŸ”₯πŸ”₯πŸ”₯ Direct, concise Default on M1/M2/M3; day-to-day use May miss tiny text; keep prompts precise
4-bit ~11–12.5 GB πŸ”₯πŸ”₯πŸ”₯ More detail retained Docs/UIs with small text; charts Slightly slower; still quantization artifacts
5-bit ~13–14 GB πŸ”₯πŸ”₯β˜† Higher fidelity Heavier document/diagram tasks Needs more RAM; occasional verbose answers
6-bit ~14.5–16 GB πŸ”₯πŸ”₯ Highest MLX fidelity Max quality under quant Can β€œthink aloud”; add be concise instruction

*Indicative for a ~15B VLM under MLX; exact numbers vary with device, image size, and context length.


πŸ§ͺ Example (COCO 000000039769.jpg β€” β€œtwo cats on a pink couch”)

Variant ⏱️ Prompt TPS ⏱️ Gen TPS πŸ“ˆ Peak RAM πŸ“ Notes
3-bit ~79 tok/s ~9.79 tok/s ~9.57 GB Direct answer; minimal β€œreasoning” leakage
6-bit ~78 tok/s ~6.50 tok/s ~14.81 GB Sometimes prints β€œHere are my reasoning steps…”

Settings: --temperature 0.0 --max-tokens 100 --device mps. Results vary by Mac model and image resolution; trend is consistent.


🧭 Choosing the Right Precision

  • I just want it to work on my Mac: πŸ‘‰ 3-bit
  • Tiny fonts / invoices / UI text matter: πŸ‘‰ 4-bit, then 5-bit if RAM allows
  • I need every drop of quality and have β‰₯16 GB free: πŸ‘‰ 6-bit (add β€œAnswer directly; do not include reasoning.”)
  • I have very little RAM: πŸ‘‰ 2-bit (expect noticeable quality loss)

βš™οΈ Suggested Settings (per variant)

Variant Max Tokens Temp Seed Notes
2-bit 64–96 0.0 0 Keep short; single image; expect omissions
3-bit 96–128 0.0 0 Great default; concise prompts help
4-bit 128–192 0.0–0.2 0 Better small-text recall; watch RAM
5-bit 128–256 0.0–0.2 0 Highest OCR among quantized tiers pre-6b
6-bit 128–256 0.0 0 Add anti-CoT phrasing (see below)

Anti-CoT prompt add-on (any bit-width):

β€œAnswer directly. Do not include your reasoning steps.”

(Optional) Add a stop string if your stack supports it (e.g., stop at "\nHere are my reasoning steps:").


πŸ› οΈ One-liners (swap model IDs)

# 2-bit
python -m mlx_vlm.generate --model <2bit-repo> --image img.jpg --prompt "Describe this image." \
  --max-tokens 96 --temperature 0.0 --device mps --seed 0

# 3-bit (recommended default)
python -m mlx_vlm.generate --model <3bit-repo> --image img.jpg --prompt "Describe this image in two sentences." \
  --max-tokens 128 --temperature 0.0 --device mps --seed 0

# 4-bit
python -m mlx_vlm.generate --model <4bit-repo> --image img.jpg --prompt "Summarize the document and read key totals." \
  --max-tokens 160 --temperature 0.1 --device mps --seed 0

# 5-bit
python -m mlx_vlm.generate --model <5bit-repo> --image img.jpg --prompt "Extract the fields (date, total, vendor) from this invoice." \
  --max-tokens 192 --temperature 0.1 --device mps --seed 0

# 6-bit
python -m mlx_vlm.generate --model <6bit-repo> --image img.jpg \
  --prompt "Answer directly. Do not include your reasoning steps.\n\nDescribe this image clearly." \
  --max-tokens 192 --temperature 0.0 --device mps --seed 0
Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/Apriel-1.5-15b-Thinker-3bit-MLX

Quantized
(22)
this model

Collection including mlx-community/Apriel-1.5-15b-Thinker-3bit-MLX