openai/gpt-oss-20b · 68.4 on Aider Polyglot with reasoning

9 days ago

more details here: https://www.reddit.com/r/LocalLLaMA/comments/1mnxwmw/unsloth_fixes_chat_template_again_gptoss120high/

pull request to update aider leader-board here: https://github.com/Aider-AI/aider/pull/4444

Fernanda24

9 days ago

subsequent runs that support the new score and more discussion here in aider discord: https://discord.gg/Y7X7bhMQFV

Fernanda24

9 days ago

*for 120b model

dibu28

7 days ago

Can you also test 20B ?

Fernanda24

2 days ago

•

edited 2 days ago

test by @tan in aider discord server

for whole edit format with smaller 20b model:

test_cases: 225
 model: openai/gpt-oss-20b
 edit_format: whole
 commit_hash: 32faf82
 reasoning_effort: high
 pass_rate_1: 18.2
 pass_rate_2: 55.6
 pass_num_1: 41
 pass_num_2: 125
 percent_cases_well_formed: 99.6
 error_outputs: 1
 num_malformed_responses: 1
 num_with_malformed_responses: 1
 user_asks: 121
 lazy_comments: 0
 syntax_errors: 0
 indentation_errors: 0
 exhausted_context_windows: 0
 prompt_tokens: 2124190
 completion_tokens: 6713723
 test_timeouts: 0
 total_tests: 225

and for diff edit format also smaller 20b model:

 model: openai/gpt-oss-20b
 edit_format: diff
 commit_hash: da45632-dirty
 reasoning_effort: high
 pass_rate_1: 13.3
 pass_rate_2: 45.3
 pass_num_1: 30
 pass_num_2: 102
 percent_cases_well_formed: 77.3
 error_outputs: 199
 num_malformed_responses: 199
 num_with_malformed_responses: 51
 user_asks: 115
 lazy_comments: 0
 syntax_errors: 0
 indentation_errors: 0
 exhausted_context_windows: 0
 prompt_tokens: 3785993
 completion_tokens: 7482667
 test_timeouts: 1
 total_tests: 225
 command: aider --model openai/gpt-oss-20b
 date: 2025-08-12
 versions: 0.86.1.dev
 seconds_per_case: 589.3
 total_cost: 0.0000

Language   | PR1   | PR2   | PN1   | PN2   | Cases | Asks  | Malformed
-----------|-------|-------|-------|-------|-------|-------|-----------
python     | 11.8  | 67.6  | 4     | 23    | 34    | 0     | 4
javascript | 8.2   | 55.1  | 4     | 27    | 49    | 1     | 2
java       | 10.6  | 36.2  | 5     | 17    | 47    | 6     | 8
cpp        | 3.8   | 46.2  | 1     | 12    | 26    | 105   | 11
go         | 12.8  | 20.5  | 5     | 8     | 39    | 3     | 166
rust       | 36.7  | 50.0  | 11    | 15    | 30    | 0     | 8

Fernanda24

2 days ago

•

edited 2 days ago

Can you also test 20B ?

done above by @Tan in Aider discord

openai
/

gpt-oss-20b

68.4 on Aider Polyglot with reasoning_effort: high