Yi Cui's picture

Yi Cui

onekq

·

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update 4 days ago

WebApp1K measures an oldest and simplest kind of task predated ChatGPT. It is code completion, you can also consider it a translation task mapping test spec into code. It requires no conversation, reasoning (which helps sometimes), or RL. I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark. Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase. What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.

updated a Space 7 days ago

onekq-ai/README

posted an update 7 days ago

GPT OSS is as of now the top open source model, whose performance is very close to Claude and GPT-5, and above all other models. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

View all activity

Organizations

onekq 's models

None public yet