|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- mixture-of-experts |
|
- moe |
|
- expert-pruning |
|
- gpt-oss |
|
- openai |
|
- reasoning |
|
- math |
|
- specialized |
|
- efficient |
|
- transformer |
|
- causal-lm |
|
- text-generation |
|
- pytorch |
|
- pruned-model |
|
- domain-specific |
|
--- |
|
|
|
# Math GPT-OSS Model (25 Experts) |
|
|
|
**Project**: https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/ |
|
|
|
<div align="center"> |
|
|
|
### 👥 Follow the Authors |
|
|
|
**Aman Priyanshu** |
|
[](https://www.linkedin.com/in/aman-priyanshu/) |
|
[](https://x.com/AmanPriyanshu6) |
|
[](https://amanpriyanshu.github.io/) |
|
|
|
**Supriti Vijay** |
|
[](https://www.linkedin.com/in/supriti-vijay/) |
|
[](https://x.com/SupritiVijay) |
|
[](https://supritivijay.github.io/) |
|
|
|
</div> |
|
|
|
## Introduction |
|
|
|
This is a pruned variant of OpenAI's GPT-OSS-20B model, reduced to 25 experts per layer based on activation patterns from the [AmanPriyanshu/GPT-OSS-20B MoE Expert Activations dataset](https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations). We analyzed router decisions across evaluation benchmarks to identify and retain experts most relevant for math tasks. |
|
|
|
**⚠️ Experimental Model**: This is an experimental pruned model that may not work well - check the [examples below](#model-examples) to see if the outputs meet your needs before use. |
|
|
|
This pruning approach reduces the model size while attempting to preserve performance on the target domain. |
|
|
|
## Model Architecture & Statistics |
|
|
|
| Metric | Value | |
|
|--------|-------| |
|
| **Base Model** | openai/gpt-oss-20b | |
|
| **Architecture** | Mixture-of-Experts Transformer | |
|
| **Total Parameters** | ~16.7B (pruned from 21B) | |
|
| **Original Experts per Layer** | 32 | |
|
| **Pruned Experts per Layer** | 25 | |
|
| **Layers** | 24 | |
|
| **Top-k Routing** | 4 | |
|
| **Context Length** | 128K tokens | |
|
| **Attention Heads** | 64 (Query), 8 (Key-Value) | |
|
| **Residual Dimension** | 2880 | |
|
| **Attention Pattern** | Alternating dense & sliding window (128 tokens) | |
|
| **Positional Encoding** | RoPE (Rotary Position Embedding) | |
|
| **Normalization** | RMSNorm | |
|
| **Precision** | BF16 | |
|
| **License** | Apache 2.0 | |
|
| **Specialization** | Math | |
|
|
|
## Pruning Methodology |
|
|
|
### What is Expert Pruning? |
|
Mixture-of-Experts models contain multiple specialized sub-networks (experts) per layer. During inference, only a subset of experts are activated for each token. Expert pruning involves: |
|
|
|
1. **Analyzing Usage Patterns**: Tracking which experts activate most frequently for specific tasks |
|
2. **Removing Underutilized Experts**: Discarding experts with low activation rates for the target domain |
|
3. **Preserving Router Functionality**: Maintaining the routing mechanism with fewer available experts |
|
|
|
### Our Approach |
|
- **Data-Driven Selection**: Used activation patterns from math evaluation tasks |
|
- **Systematic Reduction**: Reduced from 32 to 25 experts per layer |
|
- **No Retraining**: Direct removal without additional training steps |
|
|
|
## Performance & Applications |
|
|
|
### Pruning Benefits |
|
- **Smaller Memory Footprint**: 78.1% of original expert parameters |
|
- **Reduced Computational Load**: Fewer routing decisions during inference |
|
- **Focused Capabilities**: Retains experts relevant to math tasks |
|
|
|
### Use Cases |
|
- **Speculative Decoding**: Draft model for full GPT-OSS-20B |
|
- **Resource-Constrained Deployment**: Edge devices, mobile applications |
|
- **Research**: Study expert specialization in MoE models |
|
- **Fine-tuning**: Smaller base model for domain adaptation |
|
|
|
*Note: Performance may vary depending on how well the pruned experts match your specific use case.* |
|
|
|
## Motivation & Expert Selection |
|
|
|
This mathematics-focused model utilizes experts that exhibited strong performance on mathematical reasoning tasks from MMLU mathematics subjects and quantitative sections. These experts excel at mathematical computation, proof strategies, and logical reasoning. |
|
|
|
The expert selection process utilized our comprehensive analysis of router activation patterns across multiple evaluation benchmarks: |
|
|
|
- **GPQA**: Graduate-level questions in physics, chemistry, biology (Diamond & Expert subsets) |
|
- **MMLU/MMLU-Pro**: Comprehensive knowledge across 57+ subjects including science, medicine, law |
|
- **SORRY-Bench**: Safety evaluation across harmful content categories |
|
- **Tulu3**: Persona-driven instruction following with verifiable constraints |
|
- **Polyglot-or-Not**: Multilingual factual completion tasks |
|
|
|
By identifying experts that consistently activated for math tasks, we created this specialized model that maintains domain expertise while significantly reducing computational requirements from 32 to 25 experts per layer. |
|
|
|
## Dataset & Analysis Foundation |
|
|
|
This model is based on analysis from the **GPT-OSS-20B MoE Expert Activations dataset** available at: |
|
🔗 **https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations** |
|
|
|
The dataset contains router activation patterns from OpenAI's GPT-OSS-20B model across diverse evaluation benchmarks, enabling the creation of these domain-optimized models through systematic expert pruning. |
|
|
|
### Pruning Methodology |
|
Our approach involves: |
|
1. **Activation Analysis**: Comprehensive evaluation of expert usage patterns across domain-specific tasks |
|
2. **Expert Ranking**: Identification of the most frequently activated experts for target domains |
|
3. **Systematic Pruning**: Reduction from 32 to 25 experts while preserving router functionality |
|
4. **Quality Validation**: Testing to ensure maintained performance on target tasks |
|
|
|
*This is a direct pruning approach - no additional training was performed. The model inherits all capabilities from the original GPT-OSS-20B with focused expert selection.* |
|
|
|
## Usage |
|
|
|
### CPU Inference |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
# Load the specialized model on CPU |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"AmanPriyanshu/gpt-oss-16.7b-specialized-math-pruned-moe-only-25-experts", |
|
torch_dtype=torch.bfloat16, |
|
device_map="cpu", |
|
trust_remote_code=True |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("AmanPriyanshu/gpt-oss-16.7b-specialized-math-pruned-moe-only-25-experts") |
|
|
|
# Generate with the model |
|
messages = [ |
|
{"role": "user", "content": "Solve this equation: 2x + 5 = 17. Show your work step by step."} |
|
] |
|
|
|
inputs = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
return_tensors="pt", |
|
return_dict=True, |
|
reasoning_effort="medium" |
|
) |
|
|
|
# Ensure inputs are on the same device as model |
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=512, |
|
do_sample=True, |
|
temperature=0.1, |
|
top_p=0.9, |
|
pad_token_id=tokenizer.eos_token_id, |
|
eos_token_id=tokenizer.eos_token_id |
|
) |
|
|
|
# Decode only the generated part |
|
input_length = inputs['input_ids'].shape[1] |
|
response_tokens = outputs[0][input_length:] |
|
response = tokenizer.decode(response_tokens, skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
### Apple Silicon (MPS) Inference |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
# Check MPS availability and load model |
|
device = "mps" if torch.backends.mps.is_available() else "cpu" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"AmanPriyanshu/gpt-oss-16.7b-specialized-math-pruned-moe-only-25-experts", |
|
torch_dtype=torch.float16, # Better MPS compatibility |
|
device_map=device, |
|
trust_remote_code=True, |
|
low_cpu_mem_usage=True |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("AmanPriyanshu/gpt-oss-16.7b-specialized-math-pruned-moe-only-25-experts") |
|
|
|
# Generate with the model |
|
messages = [ |
|
{"role": "user", "content": "Solve this equation: 2x + 5 = 17. Show your work step by step."} |
|
] |
|
|
|
inputs = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
return_tensors="pt", |
|
return_dict=True, |
|
reasoning_effort="medium" |
|
) |
|
|
|
# Move inputs to model device |
|
inputs = {k: v.to(model.device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()} |
|
|
|
# Use torch.no_grad for MPS stability |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=512, |
|
do_sample=True, |
|
temperature=0.1, |
|
top_p=0.9, |
|
pad_token_id=tokenizer.eos_token_id, |
|
eos_token_id=tokenizer.eos_token_id, |
|
use_cache=True |
|
) |
|
|
|
# Decode only the generated part |
|
input_length = inputs['input_ids'].shape[1] |
|
response_tokens = outputs[0][input_length:] |
|
response = tokenizer.decode(response_tokens, skip_special_tokens=True) |
|
print(response) |
|
``` |
|
|
|
### GPU Inference |
|
|
|
For CUDA GPU inference, the models work with minimal modifications from the CPU example above. Simply change: |
|
```python |
|
device_map="auto" # Will automatically use GPU if available |
|
torch_dtype=torch.bfloat16 # or torch.float16 |
|
``` |
|
|
|
## Model Examples |
|
|
|
The following examples demonstrate the model's performance on math tasks: |
|
|
|
### Example 1 |
|
**Question:** Solve this equation: 2x + 5 = 17. Show your work step by step. |
|
|
|
**Response:** |
|
``` |
|
<|channel|>analysis<|message|>Just solve linear equation.<|end|><|start|>assistant<|channel|>final<|message|>**Equation to solve** |
|
|
|
\[ |
|
2x + 5 = 17 |
|
\] |
|
|
|
--- |
|
|
|
### Step 1: Isolate the term containing \(x\) |
|
|
|
Subtract \(5\) from both sides of the equation to get the \(x\)-term alone on the left: |
|
|
|
\[ |
|
2x + 5 - 5 = 17 - 5 |
|
\] |
|
|
|
\[ |
|
2x = 12 |
|
\] |
|
|
|
--- |
|
|
|
### Step 2: Solve for \(x\) |
|
|
|
Divide both sides by the coefficient of \(x\), which is \(2\): |
|
|
|
\[ |
|
\frac{2x}{2} = \frac{12}{2} |
|
\] |
|
|
|
\[ |
|
x = 6 |
|
\] |
|
|
|
--- |
|
|
|
### Check the solution |
|
|
|
Plug \(x = 6\) back into the original equation: |
|
|
|
\[ |
|
2(6) + 5 = 12 + 5 = 17 |
|
\] |
|
|
|
The left‑hand side equals the right‑hand side, so the solution is correct. |
|
|
|
--- |
|
|
|
\[ |
|
\boxed{x = 6} |
|
\]<|return|> |
|
``` |
|
|
|
--- |
|
|
|
### Example 2 |
|
**Question:** What is the Pythagorean theorem and how is it used? |
|
|
|
**Response:** |
|
``` |
|
<|channel|>analysis<|message|>Need explain theorem and usage.<|end|><|start|>assistant<|channel|>final<|message|>**The Pythagorean Theorem** |
|
|
|
> In a right‑angled triangle, the square of the length of the hypotenuse (the side opposite the right angle) equals the sum of the squares of the lengths of the other two sides. |
|
|
|
Mathematically: |
|
|
|
\[ |
|
c^{2}=a^{2}+b^{2} |
|
\] |
|
|
|
where |
|
- \(c\) = length of the hypotenuse, |
|
- \(a\) and \(b\) = lengths of the two legs. |
|
|
|
--- |
|
|
|
### How It Is Used |
|
|
|
| Context | What you do | Example | |
|
|---------|-------------|---------| |
|
| **Geometry** | Find a missing side of a right triangle | If \(a=3\) m, \(b=4\) m, then \(c=\sqrt{3^{2}+4^{2}}=\sqrt{25}=5\) m. | |
|
| **Engineering** | Design right‑angled structures (e.g., ladders, roofs) | Determine the minimum ladder length needed to reach a height of 10 m on a 6 m wall: \(c=\sqrt{10^{2}+6^{2}}\approx 11.66\) m. | |
|
| **Navigation & Surveying** | Compute straight‑line distances (great‑circle distance approximations) | Use the theorem on a planar map to estimate the shortest path between two points. | |
|
| **Computer Graphics** | Calculate pixel distances, collision detection | Distance between two points \((x_1,y_1)\) and \((x_2,y_2)\): \(\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}\). | |
|
| **Physics** | Resolve vectors into perpendicular components | If a force has components \(F_x\) and \(F_y\), its magnitude is \(\sqrt{F_x^2+F_y^2}\). | |
|
| **Trigonometry** | Derive relationships between sine, cosine, and tangent | From \(c^2=a^2+b^2\) and \(a=c\cos\theta\), \(b=c\sin\theta\), we get \(\sin^2\theta+\cos^2\theta=1\). | |
|
| **Computer Science** | Euclidean distance in higher‑dimensional spaces | For vectors \(\mathbf{u}\) and \ |
|
``` |
|
|
|
--- |
|
|
|
### Example 3 |
|
**Question:** Calculate the area of a circle with radius 7 meters. |
|
|
|
**Response:** |
|
``` |
|
<|channel|>analysis<|message|>Compute area πr^2 = π*49 ≈ 153.938.<|end|><|start|>assistant<|channel|>final<|message|>The area \(A\) of a circle is given by |
|
|
|
\[ |
|
A = \pi r^{2} |
|
\] |
|
|
|
where \(r\) is the radius. |
|
|
|
For a radius of \(r = 7\) m: |
|
|
|
\[ |
|
A = \pi (7\,\text{m})^{2} |
|
= \pi \times 49\,\text{m}^{2} |
|
= 49\pi \,\text{m}^{2} |
|
\] |
|
|
|
Using \(\pi \approx 3.14159\): |
|
|
|
\[ |
|
A \approx 49 \times 3.14159 \;\text{m}^{2} |
|
\approx 153.938 \;\text{m}^{2} |
|
\] |
|
|
|
So the area of the circle is **\(49\pi\) square meters**, which is approximately **\(154\) m²** (rounded to the nearest whole number).<|return|> |
|
``` |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{priyanshu2025gptoss, |
|
title={{GPT-OSS MoE Expert Fingerprinting: Analyzing Expert Activation Patterns in Mixture of Experts Models}}, |
|
author={Priyanshu, Aman and Vijay, Supriti}, |
|
year={2025}, |
|
howpublished={\url{https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/}}, |
|
note={Interactive analysis tool for expert activation patterns in MoE architectures} |
|
} |
|
``` |
|
|
|
## References & Resources |
|
|
|
- **Original Model**: [OpenAI GPT-OSS Model Card](https://openai.com/index/introducing-gpt-oss/) |
|
- **Model Hub**: [GPT-OSS-20B on Hugging Face](https://huggingface.co/openai/gpt-oss-20b) |
|
- **Expert Analysis Dataset**: [GPT-OSS-20B MoE Expert Activations](https://huggingface.co/datasets/AmanPriyanshu/GPT-OSS-20B-MoE-expert-activations) |
|
- **Project Page**: [GPT-OSS MoE Expert Fingerprinting](https://amanpriyanshu.github.io/GPT-OSS-MoE-ExpertFingerprinting/) |
|
- **GitHub Repository**: [OpenAI GPT-OSS](https://github.com/openai/gpt-oss) |
|
|