OpenAI GPT OSS 20B - MLX 4-bit

Model Description

This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original gpt_oss architecture to MLX format using the development version of mlx-lm (v0.26.3).

Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)

Quick Start

Installation

pip install mlx-lm

Usage

from mlx_lm import load, generate

# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")

# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)

Technical Specifications

Model Architecture

  • Model Type: gpt_oss (GPT Open Source)
  • Architecture: GptOssForCausalLM
  • Parameters: ~20 billion parameters
  • Quantization: 4-bit precision (4.504 bits per weight)

Core Parameters

Parameter Value
Hidden Size 2,880
Intermediate Size 2,880
Number of Layers 24
Attention Heads 64
Key-Value Heads 8
Head Dimension 64
Vocabulary Size 201,088

Mixture of Experts (MoE) Configuration

  • Number of Local Experts: 32
  • Experts per Token: 4
  • Router Auxiliary Loss Coefficient: 0.9
  • SwiGLU Limit: 7.0

Attention Mechanism

  • Attention Type: Hybrid sliding window and full attention
  • Sliding Window Size: 128 tokens
  • Max Position Embeddings: 131,072 tokens
  • Initial Context Length: 4,096 tokens
  • Attention Pattern: Alternating sliding and full attention layers

RoPE (Rotary Position Embedding) Configuration

  • RoPE Theta: 150,000
  • RoPE Scaling Type: YaRN (Yet another RoPE extensioN)
  • Scaling Factor: 32.0
  • Beta Fast: 32.0
  • Beta Slow: 1.0

Quantization Details

  • Quantization Method: MLX 4-bit quantization
  • Group Size: 64
  • Effective Bits per Weight: 4.504
  • Size Reduction: 13GB → 11GB (~15% reduction)

File Structure

gpt-oss-20b-MLX-4bit/
├── config.json                           # Model configuration
├── model-00001-of-00003.safetensors     # Model weights (part 1)
├── model-00002-of-00003.safetensors     # Model weights (part 2)
├── model-00003-of-00003.safetensors     # Model weights (part 3)
├── model.safetensors.index.json         # Model sharding index
├── tokenizer.json                       # Tokenizer configuration
├── tokenizer_config.json               # Tokenizer settings
├── special_tokens_map.json             # Special tokens mapping
├── generation_config.json              # Generation parameters
└── chat_template.jinja                 # Chat template

Performance Characteristics

Hardware Requirements

  • Platform: Apple Silicon (M1, M2, M3, M4 series)
  • Memory: ~11GB for model weights
  • Recommended RAM: 16GB+ for optimal performance

Layer Configuration

The model uses an alternating attention pattern across its 24 layers:

  • Even layers (0, 2, 4, ...): Sliding window attention (128 tokens)
  • Odd layers (1, 3, 5, ...): Full attention

Training Details

Tokenizer

  • Type: Custom tokenizer with 201,088 vocabulary size
  • Special Tokens:
    • EOS Token ID: 200,002
    • Pad Token ID: 199,999

Model Configuration

  • Hidden Activation: SiLU (Swish)
  • Normalization: RMSNorm (ε = 1e-05)
  • Initializer Range: 0.02
  • Attention Dropout: 0.0
  • Attention Bias: Enabled

Conversion Process

This model was successfully converted using:

  • MLX-LM Version: 0.26.3 (development branch)
  • Conversion Command:
    python3 -m mlx_lm convert \
      --hf-path "/path/to/openai-gpt-oss-20b" \
      --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \
      --quantize \
      --q-bits 4
    

Known Limitations

  1. Architecture Specificity: This model uses the gpt_oss architecture which is only supported in MLX-LM v0.26.3+
  2. Platform Dependency: Optimized specifically for Apple Silicon; may not run on other platforms
  3. Quantization Trade-offs: 4-bit quantization may result in slight quality degradation compared to full precision

Compatibility

  • MLX-LM: Requires v0.26.3 or later for gpt_oss support
  • Apple Silicon: M1, M2, M3, M4 series processors
  • macOS: Compatible with recent macOS versions supporting MLX

License

Please refer to the original OpenAI GPT OSS 20B model license terms.

Acknowledgments

  • Original model by OpenAI
  • MLX framework by Apple Machine Learning Research
  • Quantization achieved using mlx-lm development tools

Model Size: 11GB
Quantization: 4-bit (4.504 bits/weight)
Created: August 6, 2025
MLX-LM Version: 0.26.3 (development)

Downloads last month
1,562
Safetensors
Model size
20.9B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for InferenceIllusionist/gpt-oss-20b-MLX-4bit

Base model

openai/gpt-oss-20b
Quantized
(58)
this model