Why is the performance worse on the 20B than on the 120B
#84
by
megabob
- opened
Hi everyone,
I ran some load tests on gpt-oss-120B and gpt-oss-20B and got some surprising results regarding Time To First Token (TTFT).
I tested on a single H100 for both models, with command :
vllm serve <model> --async-scheduling --max-model-len 100000
Here are the results for 100 concurrent requests :
The difference isn’t huge, but I was expecting the smaller 20B model to have lower TTFT than the much larger 120B. Instead, it’s slightly higher.
Has anyone seen similar behavior?
Any insights into why the TTFT could be higher for a smaller model in this context?
It could be that you are using incorrect flags for the larger models, vLLM has detailed guide here: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html