gpt-oss-120b (Clean Fork for Stable Deployment)
This repository, mshojaei77/gpt-oss-120b
, is a clean, deployment-focused fork of the official openai/gpt-oss-120b
model.
Why Does This Fork Exist?
The original openai/gpt-oss-120b
repository contains large, non-essential folders (/original
and /metal
) which can cause issues with automated downloaders like the one used by vLLM. These extra folders can lead to "Disk quota exceeded" errors during deployment, even on systems with sufficient disk space for the core model.
This fork solves that problem by containing only the essential files required for inference.
By using this repository, you are guaranteed to download only the ~65 GB of necessary model weights and configuration files, ensuring a smooth and reliable deployment.
Model Quick Facts
- Original Model:
openai/gpt-oss-120b
- Parameters:
117B (5.1B active per forward pass) - Architecture: Mixture of Experts (MoE)
- Quantization: Pre-quantized with MXFP4 for MoE layers.
- License: Apache-2.0
- Format: The model was trained for the Harmony response format. vLLM's chat template handles this automatically.
🚀 Production-Grade Deployment with vLLM on RunPod
This is a battle-tested guide for deploying this model on a single NVIDIA H100 (80GB) GPU using vLLM and RunPod.
Step 1: Configure Your RunPod Pod
A correct disk configuration is the most critical step.
- GPU: Select
1 x H100 80GB
. - Template: Use a standard PyTorch image (e.g.,
runpod/pytorch
). - Disks (Important!):
- Container Disk:
30 GB
(This is temporary). - Volume Disk:
90 GB
or more. This is your persistent storage. - Volume Mount Path: Set to
/workspace
.
- Container Disk:
Step 2: Set Up the Environment
Connect to your pod and run the following commands to install dependencies in a persistent virtual environment.
# Install uv, a fast package manager
pip install uv
# Create and activate a virtual environment inside our persistent /workspace
uv venv --python 3.12 --seed /workspace/.venv
source /workspace/.venv/bin/activate
# Install the specialized vLLM build for gpt-oss
uv pip install --pre "vllm==0.10.1+gptoss" \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
Step 3: Launch the vLLM Server
This command will download the model from this repository and start an OpenAI-compatible API server.
First, configure your shell session:
# Create cache directories inside the persistent volume
mkdir -p /workspace/hf-cache /workspace/tmp
# Point all caching and temp operations to the 90GB volume
export HF_HOME="/workspace/hf-cache"
export TMPDIR="/workspace/tmp"
# Force vLLM to use the highly optimized FlashAttention-3 kernel
export VLLM_FLASH_ATTN_VERSION=3
Now, launch the server. The first launch will download the ~65GB model.
vllm serve mshojaei77/gpt-oss-120b \
--trust-remote-code \
--dtype bfloat16 \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768 \
--max-num-seqs 16 \
--download-dir /workspace/hf-cache
Why these flags?
--gpu-memory-utilization 0.90
: Safely uses 90% of the H100's VRAM.--max-model-len 32768
: Enables the full 32k context window.--download-dir /workspace/hf-cache
: Crucial flag. Forces vLLM to use your persistent volume, avoiding bugs where it might default to the small container disk.
⚠️ CAUTION: Do not use
--kv-cache-dtype fp8
with this setup on an H100/H200. There is a known kernel incompatibility in this vLLM build that can cause a runtime error. The H100 has sufficient VRAM to handle the 32k context in fullbfloat16
precision.
Step 4: Use the API
Once the server is running, you can connect to it using any OpenAI-compatible client. If using RunPod's public proxy, find your URL in the pod's "Ports" section.
from openai import OpenAI
# Replace with your RunPod proxy URL or http://localhost:8000/v1 if testing internally
client = OpenAI(
base_url="<YOUR_RUNPOD_PROXY_URL>/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="mshojaei77/gpt-oss-120b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(response.choices[0].message.content)
Original Model
This model is a fork of openai/gpt-oss-120b. Please refer to the original model card for all details regarding its architecture, training, and intended use.
License
This model is licensed under the Apache-2.0 License, consistent with the original repository.
- Downloads last month
- 11