gpt-oss-120b — MLX 6-bit (group size 64)

Summary. This is a 6-bit MLX quantization of gpt-oss-120B with group size 64. It targets a smaller memory footprint and higher throughput than the 8-bit gs=32 build while keeping quality close to the bf16/8-bit references.

  • Base model: openai/gpt-oss-120b (Apache-2.0)
  • Quantization: MLX int6, q_group_size=64 (some tensors may remain 16-bit for stability)
  • Files: MLX weight shards + config.json; tokenizer files included for drop-in use
  • Intended use: local inference / research on M-series Macs
  • Not intended for: safety-critical decisions; outputs may be inaccurate or biased

Requirements

Runs on Apple Silicon (M1 or newer) with macOS ≥ 13.5 via MLX (Metal).

  • Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
  • Memory guidance: notably smaller footprint vs 8-bit/gs32; 64–96 GB recommended for comfortable headroom on 120B with moderate context sizes. The effective GPU working set is capped by Metal’s budget; keep 5–10% headroom.

How to use (MLX)

pip install mlx-lm
# Python API (uses tokenizer bundled with this repo)
from mlx_lm import load, generate

model, tokenizer = load("halley-ai/gpt-oss-120b-MLX-6bit-gs64")
print(generate(
    model, tokenizer,
    prompt="Explain the Chudnovsky algorithm to compute π.",
    max_tokens=256, max_kv_size=512
))
# CLI
python -m mlx_lm generate --model halley-ai/gpt-oss-120b-MLX-6bit-gs64 \
  --prompt "Explain the Chudnovsky algorithm to compute pi." \
  --max-kv-size 512 --max-tokens 256

Evaluation

Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test) is recommended with the fast preset (window=stride=4096, ~100k tokens, EOS inserted between docs):

python python/scripts/test_perplexity-mlx.py \
  --model_path "/path/to/gpt-oss-120b-MLX-6bit-gs64" \
  --fast --progress

For more sensitive comparisons, use overlapping windows (for example, --stride 512) and evaluate the full split.

Results

Variant PPL (ctx=4096, fast)
MLX 6-bit (gs=64) 7.40
MLX 8-bit (gs=32) 7.39
MLX bf16 (reference) 7.38

Conversion details (provenance)

python -m mlx_lm convert \
  --hf-path openai/gpt-oss-120b \
  --mlx-path gpt-oss-120b-MLX-6bit-gs64 \
  --q-bits 6 --q-group-size 64 -q
  • Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.

Footprint and speed tips

  • Limit KV cache: set --max-kv-size (CLI) or max_kv_size (Python) to the smallest context you need.
  • Batching: prefer single-stream generation; large batches increase memory pressure on 120B.
  • Compute windowing: when evaluating PPL, the provided script auto-clamps the compute window to avoid Metal’s per-buffer limits.
  • Sampler settings: top‑p/top‑k sampling with moderate temperature can improve throughput versus beam search.

Sibling and reference models

  • halley-ai/gpt-oss-120b-MLX-8bit-gs32 (reference 8-bit)
  • halley-ai/gpt-oss-120b-MLX-bf16 (non-quantized reference)

Limitations and biases

Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review. Large models can be sensitive to prompt wording; prefer explicit instructions and structure.

License and credits

  • License: Apache-2.0 (inherits from base model)
  • Base model: OpenAI gpt-oss-120B
  • Quantization: Halley AI Lab (MLX int6, gs=64)
  • Please cite both the base model and this repository when you use the weights.
Downloads last month
-
Safetensors
Model size
117B params
Tensor type
BF16
·
U32
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for halley-ai/gpt-oss-120b-MLX-6bit-gs64

Quantized
(50)
this model