OpenAI GPT OSS 20B - MLX 4-bit
Model Description
This is a 4-bit quantized version of the OpenAI GPT OSS 20B model, optimized for Apple Silicon using the MLX framework. The model was successfully converted from the original gpt_oss
architecture to MLX format using the development version of mlx-lm
(v0.26.3).
Tip: For best results with tool calling and reasoning, update your LMStudio to the latest version (0.3.22)
Quick Start
Installation
pip install mlx-lm
Usage
from mlx_lm import load, generate
# Load the quantized model
model, tokenizer = load("/Users/thomas/Documents/Model Weights/gpt-oss-20b-MLX-4bit")
# Generate text
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, verbose=True, max_tokens=512)
print(response)
Technical Specifications
Model Architecture
- Model Type:
gpt_oss
(GPT Open Source) - Architecture:
GptOssForCausalLM
- Parameters: ~20 billion parameters
- Quantization: 4-bit precision (4.504 bits per weight)
Core Parameters
Parameter | Value |
---|---|
Hidden Size | 2,880 |
Intermediate Size | 2,880 |
Number of Layers | 24 |
Attention Heads | 64 |
Key-Value Heads | 8 |
Head Dimension | 64 |
Vocabulary Size | 201,088 |
Mixture of Experts (MoE) Configuration
- Number of Local Experts: 32
- Experts per Token: 4
- Router Auxiliary Loss Coefficient: 0.9
- SwiGLU Limit: 7.0
Attention Mechanism
- Attention Type: Hybrid sliding window and full attention
- Sliding Window Size: 128 tokens
- Max Position Embeddings: 131,072 tokens
- Initial Context Length: 4,096 tokens
- Attention Pattern: Alternating sliding and full attention layers
RoPE (Rotary Position Embedding) Configuration
- RoPE Theta: 150,000
- RoPE Scaling Type: YaRN (Yet another RoPE extensioN)
- Scaling Factor: 32.0
- Beta Fast: 32.0
- Beta Slow: 1.0
Quantization Details
- Quantization Method: MLX 4-bit quantization
- Group Size: 64
- Effective Bits per Weight: 4.504
- Size Reduction: 13GB → 11GB (~15% reduction)
File Structure
gpt-oss-20b-MLX-4bit/
├── config.json # Model configuration
├── model-00001-of-00003.safetensors # Model weights (part 1)
├── model-00002-of-00003.safetensors # Model weights (part 2)
├── model-00003-of-00003.safetensors # Model weights (part 3)
├── model.safetensors.index.json # Model sharding index
├── tokenizer.json # Tokenizer configuration
├── tokenizer_config.json # Tokenizer settings
├── special_tokens_map.json # Special tokens mapping
├── generation_config.json # Generation parameters
└── chat_template.jinja # Chat template
Performance Characteristics
Hardware Requirements
- Platform: Apple Silicon (M1, M2, M3, M4 series)
- Memory: ~11GB for model weights
- Recommended RAM: 16GB+ for optimal performance
Layer Configuration
The model uses an alternating attention pattern across its 24 layers:
- Even layers (0, 2, 4, ...): Sliding window attention (128 tokens)
- Odd layers (1, 3, 5, ...): Full attention
Training Details
Tokenizer
- Type: Custom tokenizer with 201,088 vocabulary size
- Special Tokens:
- EOS Token ID: 200,002
- Pad Token ID: 199,999
Model Configuration
- Hidden Activation: SiLU (Swish)
- Normalization: RMSNorm (ε = 1e-05)
- Initializer Range: 0.02
- Attention Dropout: 0.0
- Attention Bias: Enabled
Conversion Process
This model was successfully converted using:
- MLX-LM Version: 0.26.3 (development branch)
- Conversion Command:
python3 -m mlx_lm convert \ --hf-path "/path/to/openai-gpt-oss-20b" \ --mlx-path "/path/to/gpt-oss-20b-MLX-4bit" \ --quantize \ --q-bits 4
Known Limitations
- Architecture Specificity: This model uses the
gpt_oss
architecture which is only supported in MLX-LM v0.26.3+ - Platform Dependency: Optimized specifically for Apple Silicon; may not run on other platforms
- Quantization Trade-offs: 4-bit quantization may result in slight quality degradation compared to full precision
Compatibility
- MLX-LM: Requires v0.26.3 or later for
gpt_oss
support - Apple Silicon: M1, M2, M3, M4 series processors
- macOS: Compatible with recent macOS versions supporting MLX
License
Please refer to the original OpenAI GPT OSS 20B model license terms.
Acknowledgments
- Original model by OpenAI
- MLX framework by Apple Machine Learning Research
- Quantization achieved using
mlx-lm
development tools
Model Size: 11GB
Quantization: 4-bit (4.504 bits/weight)
Created: August 6, 2025
MLX-LM Version: 0.26.3 (development)
- Downloads last month
- 1,562
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for InferenceIllusionist/gpt-oss-20b-MLX-4bit
Base model
openai/gpt-oss-20b