🚀 Best Practices for Evaluating GPT-OSS Models: Speed & Benchmark Testing Guide
On August 6, 2025, OpenAI released two open-source models:
gpt-oss-120b
— Suitable for production environments, general-purpose tasks, and scenarios requiring high reasoning capabilities. Can run on a single H100 GPU (117B parameters, including 5.1B activation parameters).gpt-oss-20b
— Suitable for low-latency, local, or specific-use scenarios (21B parameters, including 3.6B activation parameters).
Let’s use the EvalScope model evaluation framework to quickly test the inference speed and benchmark performance of these models.
Environment Setup
To make model deployment easier and improve inference speed, we use vLLM to launch a web service compatible with the OpenAI API format.
⚠️ Note: As of August 6, 2025, vLLM version 0.10.1, which supports the gpt-oss models, has not been officially released yet. You need to install vLLM and gpt-oss dependencies from source. It is recommended to start a new Python 3.12 environment to avoid affecting your existing environment.
- Create and activate a new conda environment:
conda create -n gpt_oss_vllm python=3.12
conda activate gpt_oss_vllm
- Install the necessary dependencies:
# Install PyTorch-nightly and vLLM
pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128
# Install FlashInfer
pip install flashinfer-python==0.2.10
# Install evalscope
pip install evalscope[perf] -U
- Start the model service
We successfully launched the gpt-oss-20b model service on an H20 GPU:
To download the model via ModelScope (recommended):
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 VLLM_USE_MODELSCOPE=true vllm serve openai-mirror/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801
To download the model via HuggingFace:
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801
Inference Speed Test
We use EvalScope’s inference speed testing feature to evaluate the model’s inference speed.
Test environment:
- GPU: H20-96GB * 1
- vLLM version: 0.10.1+gptoss
- Prompt length: 1024 tokens
- Output length: 1024 tokens
Run the test script:
evalscope perf \
--parallel 1 10 50 100 \
--number 5 20 100 200 \
--model gpt-oss-20b \
--url http://127.0.0.1:8801/v1/completions \
--api openai \
--dataset random \
--max-tokens 1024 \
--min-tokens 1024 \
--prefix-length 0 \
--min-prompt-length 1024 \
--max-prompt-length 1024 \
--log-every-n-query 20 \
--tokenizer-path openai-mirror/gpt-oss-20b \
--extra-args '{"ignore_eos": true}'
Output:
Benchmark Evaluation
We use EvalScope’s benchmark testing function to evaluate the model’s abilities. Here we use the AIME2025 mathematical reasoning benchmark as an example to test the model’s capabilities.
Run the test script:
from evalscope.constants import EvalType
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='gpt-oss-20b', # Model name
api_url='http://127.0.0.1:8801/v1', # Model service address
eval_type=EvalType.SERVICE, # Evaluation type, here using service evaluation
datasets=['aime25'], # Dataset to test
generation_config={
'extra_body': {"reasoning_effort": "high"} # Model generation parameters, set to high reasoning level
},
eval_batch_size=10, # Concurrent batch size
timeout=60000, # Timeout in seconds
)
run_task(task_cfg=task_cfg)
Sample output:
The test result here is 0.8. You can try different model generation parameters and test multiple times to see the results.
+-------------+-----------+---------------+-------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=============+===========+===============+=============+=======+=========+=========+
| gpt-oss-20b | aime25 | AveragePass@1 | AIME2025-I | 15 | 0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25 | AveragePass@1 | AIME2025-II | 15 | 0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25 | AveragePass@1 | OVERALL | 30 | 0.8 | - |
+-------------+-----------+---------------+-------------+-------+---------+---------+
For more supported benchmarks, please refer to the EvalScope documentation.
Result Visualization
EvalScope supports visualizing results so you can see the model’s specific outputs.
pip install 'evalscope[app]'
evalscope app --lang en
Summary
Through the above steps, we have successfully tested the inference speed and benchmark capabilities of the GPT-OSS model using EvalScope. GPT-OSS performs excellently in both inference speed and benchmarking, making it suitable for production and high-performance scenarios.
It does perform well, from 0 to Refusal in just 3.2 seconds.