23 19 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update 1 day ago

Okay, Qwen3 coder does much better than Qwen3 (coding model for coding), but GPT OSS still maintains SOTA for open source models. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

updated a Space 1 day ago

onekq-ai/WebApp1K-models-leaderboard

posted an update 3 days ago

Kimi K2 is a bit disappointing by my expectations. It is on a par with Codex mini. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

View all activity

Organizations

posted an update 1 day ago

Post

135

Okay, Qwen3 coder does much better than Qwen3 (coding model for coding), but GPT OSS still maintains SOTA for open source models.

onekq-ai/WebApp1K-models-leaderboard

updated a Space 1 day ago

WebApp1K Models Leaderboard

🥇

Display leaderboard for model performance metrics

posted an update 3 days ago

Post

2269

Kimi K2 is a bit disappointing by my expectations. It is on a par with Codex mini.

onekq-ai/WebApp1K-models-leaderboard

1 reply

replied to their post 3 days ago

I checked Codex and Claude, and they both use their own models to compress.

For OCR, as you stated lots of training would need to be done. It's an ecosystem problem.

replied to their post 7 days ago

Good stuff! I didn't consider token cost at all.

I'm thinking about an open source project for a context compressor (algorithmic, at most a small on-premise model) for agent builders. Does this make sense? If so, how should it look like?

replied to their post 7 days ago

Do you know any work that studied how agents use context?

posted an update 8 days ago

Post

3611

Context rot is such a catchy phrase, but the problem has been identified 2+ years ago, called attention decay.
Lost in the Middle: How Language Models Use Long Contexts (2307.03172)

I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130).

Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.

Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.

6 replies

replied to their post 8 days ago

That is the case.

I'm a developer at heart. As a developer, majority of your time is spent running things and hop around environments, e.g. IDE, Cloud, Github. These environments all happen to have full featured bash support, a perfect sandbox for the CLI form factor.

The paradigm change AI brought to the developer world is nothing short of meteoric, but also an exception. Lots of efforts try to generalize the momentum to the next area(s). I won't bet on them.

posted an update 9 days ago

Post

4165

I am on the model layer and focus on atomic tasks, so I don't get involved in product discussions. But this provocative article provoked the community quite a bit. The case in point is Claude Code, which happens to be my biggest productivity revolution since ChatGPT.

RAG predated TUI and agents. So to be fair it's quite an achievement to survive the AI evolution. But I feel it is overshadowed by context engineering in the agent era. How does everyone feel about this?

https://www.nicolasbustamante.com/p/the-rag-obituary-killed-by-agents

2 replies

posted an update 11 days ago

Post

320

I tried to test the DeepSeek OCR model on a diagram-to-SQL task, i.e. visualize SQL schema to a E-R diagram, the combine the diagram with natural language question as prompt. The model outputs SQL query but unusable. The multimodality model (DeepSeek VL) performs better, but the good old coding LLM is far better.

So this model is still, and meant to be, an OCR model. It does compress long context in a new way, but will have to be trained for other tasks will long context will be applied. OCR itself doesn't need long context.

TLDR: lots of work will have to be done to make this main stream.

posted an update 12 days ago

Post

273

I wrote this article to explain the difference between vision token and text token. They are apples and oranges, but also the source of compression efficiency of DeepSeek OCR (don't forget Glyph by THUDM!)

https://huggingface.co/blog/onekq/behind-each-token

I am running experiment with DeepSeek OCR BTW

published an article 13 days ago

Article

Vision Tokens vs Text Tokens: Understanding the 10× Compression

•

13 days ago

• 6

posted an update 13 days ago

Post

175

Random thought on 🐋DeepSeek🐋 OCR model: layout software/design will be hot, presentation matters way more than wording.

posted an update 19 days ago

Post

225

Claude 4.5 is just slightly behind GPT-5
onekq-ai/WebApp1K-models-leaderboard

1 reply

posted an update 21 days ago

Post

225

DeepSeek 3.2-exp is doing better than v3, but behind R1. This is quite interesting.
onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

323

WebApp1K measures an oldest and simplest kind of task predated ChatGPT. It is code completion, you can also consider it a translation task mapping test spec into code. It requires no conversation, reasoning (which helps sometimes), or RL.

I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.

Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.

What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.

updated a Space about 1 month ago

README

🌖

posted an update about 1 month ago

Post

227

GPT OSS is as of now the top open source model, whose performance is very close to Claude and GPT-5, and above all other models.

onekq-ai/WebApp1K-models-leaderboard

posted an update about 1 month ago

Post

276

My book on WebApp1K is published on Amazon

https://www.amazon.com/dp/9999331130
https://elivabooks.com/en/book/book-8002082768

updated a dataset about 1 month ago

onekq-ai/WebApp1K-Duo-React-Generations

Viewer • Updated Sep 21 • 1k • 13

Yi Cui

AI & ML interests

Recent Activity

Organizations

onekq's activity

WebApp1K Models Leaderboard

Vision Tokens vs Text Tokens: Understanding the 10× Compression

README