🚀 Best Practices for Evaluating GPT-OSS Models: Speed & Benchmark Testing Guide

#64
by Yunxz - opened

On August 6, 2025, OpenAI released two open-source models:

  • gpt-oss-120b — Suitable for production environments, general-purpose tasks, and scenarios requiring high reasoning capabilities. Can run on a single H100 GPU (117B parameters, including 5.1B activation parameters).
  • gpt-oss-20b — Suitable for low-latency, local, or specific-use scenarios (21B parameters, including 3.6B activation parameters).

Let’s use the EvalScope model evaluation framework to quickly test the inference speed and benchmark performance of these models.

Environment Setup

To make model deployment easier and improve inference speed, we use vLLM to launch a web service compatible with the OpenAI API format.

⚠️ Note: As of August 6, 2025, vLLM version 0.10.1, which supports the gpt-oss models, has not been officially released yet. You need to install vLLM and gpt-oss dependencies from source. It is recommended to start a new Python 3.12 environment to avoid affecting your existing environment.

  1. Create and activate a new conda environment:
conda create -n gpt_oss_vllm python=3.12
conda activate gpt_oss_vllm
  1. Install the necessary dependencies:
# Install PyTorch-nightly and vLLM
pip install --pre vllm==0.10.1+gptoss \
    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
    --extra-index-url https://download.pytorch.org/whl/nightly/cu128
# Install FlashInfer
pip install flashinfer-python==0.2.10
# Install evalscope
pip install evalscope[perf] -U
  1. Start the model service

We successfully launched the gpt-oss-20b model service on an H20 GPU:

To download the model via ModelScope (recommended):

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 VLLM_USE_MODELSCOPE=true vllm serve openai-mirror/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801

To download the model via HuggingFace:

VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve openai/gpt-oss-20b --served-model-name gpt-oss-20b --trust_remote_code --port 8801

Inference Speed Test

We use EvalScope’s inference speed testing feature to evaluate the model’s inference speed.

Test environment:

  • GPU: H20-96GB * 1
  • vLLM version: 0.10.1+gptoss
  • Prompt length: 1024 tokens
  • Output length: 1024 tokens

Run the test script:

evalscope perf \
  --parallel 1 10 50 100 \
  --number 5 20 100 200 \
  --model gpt-oss-20b \
  --url http://127.0.0.1:8801/v1/completions \
  --api openai \
  --dataset random \
  --max-tokens 1024 \
  --min-tokens 1024 \
  --prefix-length 0 \
  --min-prompt-length 1024 \
  --max-prompt-length 1024 \
  --log-every-n-query 20 \
  --tokenizer-path openai-mirror/gpt-oss-20b \
  --extra-args '{"ignore_eos": true}'

Output:

image.png

Benchmark Evaluation

We use EvalScope’s benchmark testing function to evaluate the model’s abilities. Here we use the AIME2025 mathematical reasoning benchmark as an example to test the model’s capabilities.

Run the test script:

from evalscope.constants import EvalType
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model='gpt-oss-20b',  # Model name
    api_url='http://127.0.0.1:8801/v1',  # Model service address
    eval_type=EvalType.SERVICE, # Evaluation type, here using service evaluation
    datasets=['aime25'],  # Dataset to test
    generation_config={
        'extra_body': {"reasoning_effort": "high"}  # Model generation parameters, set to high reasoning level
    },
    eval_batch_size=10, # Concurrent batch size
    timeout=60000, # Timeout in seconds
)

run_task(task_cfg=task_cfg)

Sample output:

The test result here is 0.8. You can try different model generation parameters and test multiple times to see the results.

+-------------+-----------+---------------+-------------+-------+---------+---------+
| Model       | Dataset   | Metric        | Subset      |   Num |   Score | Cat.0   |
+=============+===========+===============+=============+=======+=========+=========+
| gpt-oss-20b | aime25    | AveragePass@1 | AIME2025-I  |    15 |     0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25    | AveragePass@1 | AIME2025-II |    15 |     0.8 | default |
+-------------+-----------+---------------+-------------+-------+---------+---------+
| gpt-oss-20b | aime25    | AveragePass@1 | OVERALL     |    30 |     0.8 | -       |
+-------------+-----------+---------------+-------------+-------+---------+---------+ 

For more supported benchmarks, please refer to the EvalScope documentation.

Result Visualization

EvalScope supports visualizing results so you can see the model’s specific outputs.

pip install 'evalscope[app]'
evalscope app --lang en

image.png

Summary

Through the above steps, we have successfully tested the inference speed and benchmark capabilities of the GPT-OSS model using EvalScope. GPT-OSS performs excellently in both inference speed and benchmarking, making it suitable for production and high-performance scenarios.

It does perform well, from 0 to Refusal in just 3.2 seconds.

Sign up or log in to comment