OpenAI GPT OSS 20B - MLX 4-bit

Model Description

This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original gpt_oss architecture to MLX format using the development version of mlx-lm (v0.26.3).

Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)

Quick Start

Installation

pip install mlx-lm

Usage

from mlx_lm import load, generate

# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)

Technical Specifications

Model Architecture

Model Type: gpt_oss (GPT Open Source)
Architecture: GptOssForCausalLM
Parameters: ~20 billion parameters
Quantization: 4-bit precision (4.504 bits per weight)

Core Parameters

Parameter	Value
Hidden Size	2,880
Intermediate Size	2,880
Number of Layers	24
Attention Heads	64
Key-Value Heads	8
Head Dimension	64
Vocabulary Size	201,088

Mixture of Experts (MoE) Configuration

Number of Local Experts: 32
Experts per Token: 4
Router Auxiliary Loss Coefficient: 0.9
SwiGLU Limit: 7.0

Attention Mechanism

Attention Type: Hybrid sliding window and full attention
Sliding Window Size: 128 tokens
Max Position Embeddings: 131,072 tokens
Initial Context Length: 4,096 tokens
Attention Pattern: Alternating sliding and full attention layers

RoPE (Rotary Position Embedding) Configuration

RoPE Theta: 150,000
RoPE Scaling Type: YaRN (Yet another RoPE extensioN)
Scaling Factor: 32.0
Beta Fast: 32.0
Beta Slow: 1.0

Quantization Details

Quantization Method: MLX 4-bit quantization
Group Size: 64
Effective Bits per Weight: 4.504
Size Reduction: 13GB → 11GB (~15% reduction)

File Structure

gpt-oss-20b-MLX-4bit/
├── config.json                           # Model configuration
├── model-00001-of-00003.safetensors     # Model weights (part 1)
├── model-00002-of-00003.safetensors     # Model weights (part 2)
├── model-00003-of-00003.safetensors     # Model weights (part 3)
├── model.safetensors.index.json         # Model sharding index
├── tokenizer.json                       # Tokenizer configuration
├── tokenizer_config.json               # Tokenizer settings
├── special_tokens_map.json             # Special tokens mapping
├── generation_config.json              # Generation parameters
└── chat_template.jinja                 # Chat template

Performance Characteristics

Hardware Requirements

Platform: Apple Silicon (M1, M2, M3, M4 series)
Memory: ~11GB for model weights
Recommended RAM: 16GB+ for optimal performance

Layer Configuration

The model uses an alternating attention pattern across its 24 layers:

Even layers (0, 2, 4, ...): Sliding window attention (128 tokens)
Odd layers (1, 3, 5, ...): Full attention

Training Details

Tokenizer

Type: Custom tokenizer with 201,088 vocabulary size
Special Tokens:
- EOS Token ID: 200,002
- Pad Token ID: 199,999

Model Configuration

Hidden Activation: SiLU (Swish)
Normalization: RMSNorm (ε = 1e-05)
Initializer Range: 0.02
Attention Dropout: 0.0
Attention Bias: Enabled

Conversion Process

This model was successfully converted using:

MLX-LM Version: 0.26.3 (development branch)

Conversion Command:

python3 -m mlx_lm convert \
  --hf-path "/path/to/openai-gpt-oss-20b" \
  --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
  --quantize \
  --q-bits 4

Known Limitations

Architecture Specificity: This model uses the gpt_oss architecture which is only supported in MLX-LM v0.26.3+
Platform Dependency: Optimized specifically for Apple Silicon; may not run on other platforms
Quantization Trade-offs: 4-bit quantization may result in slight quality degradation compared to full precision

Compatibility

MLX-LM: Requires v0.26.3 or later for gpt_oss support
Apple Silicon: M1, M2, M3, M4 series processors
macOS: Compatible with recent macOS versions supporting MLX

License

Please refer to the original OpenAI GPT OSS 20B model license terms.

Acknowledgments

Original model by OpenAI
MLX framework by Apple Machine Learning Research
Quantization achieved using mlx-lm development tools

Model Size: 11GB
Quantization: 4-bit (4.504 bits/weight)
Created: August 6, 2025
MLX-LM Version: 0.26.3 (development)

InferenceIllusionist
/

gpt-oss-20b-MLX-4bit

OpenAI GPT OSS 20B - MLX 4-bit

Model Description

Quick Start

Installation

Usage

Technical Specifications

Model Architecture

Core Parameters

Mixture of Experts (MoE) Configuration

Attention Mechanism

RoPE (Rotary Position Embedding) Configuration

Quantization Details

File Structure

Performance Characteristics

Hardware Requirements

Layer Configuration

Training Details

Tokenizer

Model Configuration

Conversion Process

Known Limitations

Compatibility

License

Acknowledgments

Model tree for InferenceIllusionist/gpt-oss-20b-MLX-4bit