gpt-oss-120b (Clean Fork for Stable Deployment)

This repository, mshojaei77/gpt-oss-120b, is a clean, deployment-focused fork of the official openai/gpt-oss-120b model.

Why Does This Fork Exist?

The original openai/gpt-oss-120b repository contains large, non-essential folders (/original and /metal) which can cause issues with automated downloaders like the one used by vLLM. These extra folders can lead to "Disk quota exceeded" errors during deployment, even on systems with sufficient disk space for the core model.

This fork solves that problem by containing only the essential files required for inference.

By using this repository, you are guaranteed to download only the ~65 GB of necessary model weights and configuration files, ensuring a smooth and reliable deployment.

Model Quick Facts

Original Model: openai/gpt-oss-120b
Parameters: ~~117B (~~5.1B active per forward pass)
Architecture: Mixture of Experts (MoE)
Quantization: Pre-quantized with MXFP4 for MoE layers.
License: Apache-2.0
Format: The model was trained for the Harmony response format. vLLM's chat template handles this automatically.

🚀 Production-Grade Deployment with vLLM on RunPod

This is a battle-tested guide for deploying this model on a single NVIDIA H100 (80GB) GPU using vLLM and RunPod.

Step 1: Configure Your RunPod Pod

A correct disk configuration is the most critical step.

GPU: Select 1 x H100 80GB.
Template: Use a standard PyTorch image (e.g., runpod/pytorch).
Disks (Important!):
- Container Disk: 30 GB (This is temporary).
- Volume Disk: 90 GB or more. This is your persistent storage.
- Volume Mount Path: Set to /workspace.

Step 2: Set Up the Environment

Connect to your pod and run the following commands to install dependencies in a persistent virtual environment.

# Install uv, a fast package manager
pip install uv

# Create and activate a virtual environment inside our persistent /workspace
uv venv --python 3.12 --seed /workspace/.venv
source /workspace/.venv/bin/activate

# Install the specialized vLLM build for gpt-oss
uv pip install --pre "vllm==0.10.1+gptoss" \
  --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
  --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
  --index-strategy unsafe-best-match

Step 3: Launch the vLLM Server

This command will download the model from this repository and start an OpenAI-compatible API server.

First, configure your shell session:

# Create cache directories inside the persistent volume
mkdir -p /workspace/hf-cache /workspace/tmp

# Point all caching and temp operations to the 90GB volume
export HF_HOME="/workspace/hf-cache"
export TMPDIR="/workspace/tmp"

# Force vLLM to use the highly optimized FlashAttention-3 kernel
export VLLM_FLASH_ATTN_VERSION=3

Now, launch the server. The first launch will download the ~65GB model.

vllm serve mshojaei77/gpt-oss-120b \
  --trust-remote-code \
  --dtype bfloat16 \
  --port 8000 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --max-num-seqs 16 \
  --download-dir /workspace/hf-cache

Why these flags?

--gpu-memory-utilization 0.90: Safely uses 90% of the H100's VRAM.
--max-model-len 32768: Enables the full 32k context window.
--download-dir /workspace/hf-cache: Crucial flag. Forces vLLM to use your persistent volume, avoiding bugs where it might default to the small container disk.

⚠️ CAUTION: Do not use --kv-cache-dtype fp8 with this setup on an H100/H200. There is a known kernel incompatibility in this vLLM build that can cause a runtime error. The H100 has sufficient VRAM to handle the 32k context in full bfloat16 precision.

Step 4: Use the API

Once the server is running, you can connect to it using any OpenAI-compatible client. If using RunPod's public proxy, find your URL in the pod's "Ports" section.

from openai import OpenAI

# Replace with your RunPod proxy URL or http://localhost:8000/v1 if testing internally
client = OpenAI(
    base_url="<YOUR_RUNPOD_PROXY_URL>/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="mshojaei77/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)

print(response.choices[0].message.content)

Original Model

This model is a fork of openai/gpt-oss-120b. Please refer to the original model card for all details regarding its architecture, training, and intended use.

License

This model is licensed under the Apache-2.0 License, consistent with the original repository.