AIME 25 Accuracy Discrepancy for GPT-OSS-20B (Reasoning Effort=High)

#58
by jiayi37u - opened

Thank you very much for open-sourcing such a powerful large language model.
I’ve noticed that the community has run into difficulties reproducing the results reported in your paper. On the AIME 25, the paper states that GPT-OSS-20B (without tools, with reasoning effort mode set to “high”) achieved an accuracy of 91.7%, whereas our reproduction using vLLM only reached 85.8%. Do you have any suggestions to help us replicate your evaluation results?

image.png

Reference link: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use

Hey :) We published all of our eval code here https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals

Sign up or log in to comment