openai/gpt-oss-20b · AIME 25 Accuracy Discrepancy for GPT-OSS-20B (Reasoning Effort=High)

Thank you very much for open-sourcing such a powerful large language model.
I’ve noticed that the community has run into difficulties reproducing the results reported in your paper. On the AIME 25, the paper states that GPT-OSS-20B (without tools, with reasoning effort mode set to “high”) achieved an accuracy of 91.7%, whereas our reproduction using vLLM only reached 85.8%. Do you have any suggestions to help us replicate your evaluation results?

Reference link: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use