Post
276
WebApp1K measures an oldest and simplest kind of task predated ChatGPT. It is code completion, you can also consider it a translation task mapping test spec into code. It requires no conversation, reasoning (which helps sometimes), or RL.
I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.
Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.
What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.
I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.
Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.
What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.