Why is the performance worse on the 20B than on the 120B

#84
by megabob - opened

Hi everyone,

I ran some load tests on gpt-oss-120B and gpt-oss-20B and got some surprising results regarding Time To First Token (TTFT).

I tested on a single H100 for both models, with command :

vllm serve <model> --async-scheduling --max-model-len 100000

Here are the results for 100 concurrent requests :

image.png

The difference isn’t huge, but I was expecting the smaller 20B model to have lower TTFT than the much larger 120B. Instead, it’s slightly higher.

Has anyone seen similar behavior?
Any insights into why the TTFT could be higher for a smaller model in this context?

It could be that you are using incorrect flags for the larger models, vLLM has detailed guide here: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html

Thanks for your answer, but we are already using the recommended flags from the guide

Sign up or log in to comment