sebastavar commited on
Commit
9c7d58d
·
verified ·
1 Parent(s): 470b0af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -3
README.md CHANGED
@@ -1,3 +1,117 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx
3
+ pipeline_tag: text-generation
4
+ inference: false
5
+ license: apache-2.0
6
+ base_model: openai/gpt-oss-120b
7
+ base_model_relation: quantized
8
+ language:
9
+ - en
10
+ - ro
11
+ tags:
12
+ - apple-silicon
13
+ - metal
14
+ - arm64
15
+ - 6-bit
16
+ - group-size-64
17
+ - mlx
18
+ - mlx-lm
19
+ - openai
20
+ - halley-ai
21
+ ---
22
+
23
+ # gpt-oss-120b — MLX 6-bit (group size 64)
24
+
25
+ **Summary.** This is a 6-bit MLX quantization of gpt-oss-120B with group size 64. It targets a smaller memory footprint and higher throughput than the 8-bit gs=32 build while keeping quality close to the bf16/8-bit references.
26
+
27
+ - **Base model:** `openai/gpt-oss-120b` (Apache-2.0)
28
+ - **Quantization:** MLX int6, `q_group_size=64` (some tensors may remain 16-bit for stability)
29
+ - **Files:** MLX weight shards + `config.json`; tokenizer files included for drop-in use
30
+ - **Intended use:** local inference / research on M-series Macs
31
+ - **Not intended for:** safety-critical decisions; outputs may be inaccurate or biased
32
+
33
+ ## Requirements
34
+
35
+ Runs on Apple Silicon (M1 or newer) with macOS ≥ 13.5 via MLX (Metal).
36
+
37
+ - Not supported: Intel macOS / Linux / Windows (consider a GGUF build + llama.cpp instead).
38
+ - Memory guidance: notably smaller footprint vs 8-bit/gs32; 64–96 GB recommended for comfortable headroom on 120B with moderate context sizes. The effective GPU working set is capped by Metal’s budget; keep 5–10% headroom.
39
+
40
+ ## How to use (MLX)
41
+
42
+ ```bash
43
+ pip install mlx-lm
44
+ ```
45
+
46
+ ```python
47
+ # Python API (uses tokenizer bundled with this repo)
48
+ from mlx_lm import load, generate
49
+
50
+ model, tokenizer = load("halley-ai/gpt-oss-120b-MLX-6bit-gs64")
51
+ print(generate(
52
+ model, tokenizer,
53
+ prompt="Explain the Chudnovsky algorithm to compute π.",
54
+ max_tokens=256, max_kv_size=512
55
+ ))
56
+ ```
57
+
58
+ ```bash
59
+ # CLI
60
+ python -m mlx_lm generate --model halley-ai/gpt-oss-120b-MLX-6bit-gs64 \
61
+ --prompt "Explain the Chudnovsky algorithm to compute pi." \
62
+ --max-kv-size 512 --max-tokens 256
63
+ ```
64
+
65
+ ## Evaluation
66
+
67
+ Perplexity (PPL) streaming evaluation on WikiText-2 (raw, test) is recommended with the fast preset (`window=stride=4096`, ~100k tokens, EOS inserted between docs):
68
+
69
+ ```bash
70
+ python python/scripts/test_perplexity-mlx.py \
71
+ --model_path "/path/to/gpt-oss-120b-MLX-6bit-gs64" \
72
+ --fast --progress
73
+ ```
74
+
75
+ For more sensitive comparisons, use overlapping windows (for example, `--stride 512`) and evaluate the full split.
76
+
77
+ ### Results
78
+
79
+ | Variant | PPL (ctx=4096, fast) |
80
+ |----------------------|-----------------------|
81
+ | MLX 6-bit (gs=64) | 7.40 |
82
+ | MLX 8-bit (gs=32) | 7.39 |
83
+ | MLX bf16 (reference) | 7.38 |
84
+
85
+ ## Conversion details (provenance)
86
+
87
+ ```bash
88
+ python -m mlx_lm convert \
89
+ --hf-path openai/gpt-oss-120b \
90
+ --mlx-path gpt-oss-120b-MLX-6bit-gs64 \
91
+ --q-bits 6 --q-group-size 64 -q
92
+ ```
93
+
94
+ - Some tensors (for example, embeddings/norms/router) may remain 16-bit for numerical stability.
95
+
96
+ ## Footprint and speed tips
97
+
98
+ - **Limit KV cache:** set `--max-kv-size` (CLI) or `max_kv_size` (Python) to the smallest context you need.
99
+ - **Batching:** prefer single-stream generation; large batches increase memory pressure on 120B.
100
+ - **Compute windowing:** when evaluating PPL, the provided script auto-clamps the compute window to avoid Metal’s per-buffer limits.
101
+ - **Sampler settings:** top‑p/top‑k sampling with moderate temperature can improve throughput versus beam search.
102
+
103
+ ## Sibling and reference models
104
+
105
+ - halley-ai/gpt-oss-120b-MLX-8bit-gs32 (reference 8-bit)
106
+ - halley-ai/gpt-oss-120b-MLX-bf16 (non-quantized reference)
107
+
108
+ ## Limitations and biases
109
+
110
+ Outputs may be factually wrong or unsafe. Do not use for medical, legal, or financial decisions without human review. Large models can be sensitive to prompt wording; prefer explicit instructions and structure.
111
+
112
+ ## License and credits
113
+
114
+ - License: Apache-2.0 (inherits from base model)
115
+ - Base model: OpenAI gpt-oss-120B
116
+ - Quantization: Halley AI Lab (MLX int6, gs=64)
117
+ - Please cite both the base model and this repository when you use the weights.