onekq-ai/WebApp1K-models-leaderboard
Yi Cui
AI & ML interests
Recent Activity
Organizations
onekq-ai/WebApp1K-models-leaderboard
onekq-ai/WebApp1K-models-leaderboard
I checked Codex and Claude, and they both use their own models to compress.
For OCR, as you stated lots of training would need to be done. It's an ecosystem problem.
Good stuff! I didn't consider token cost at all.
I'm thinking about an open source project for a context compressor (algorithmic, at most a small on-premise model) for agent builders. Does this make sense? If so, how should it look like?
Do you know any work that studied how agents use context?
Lost in the Middle: How Language Models Use Long Contexts (2307.03172)
I spotted the same problem in coding tasks, and documented in my book (https://www.amazon.com/dp/9999331130).
Why did this problem become hot again? This is because many of us thought the problem has been solved by long context models, which is not true.
Here we were misled by benchmarks. Most long-context benchmarks build around the QA scenario, i.e. "finding needle in haystack". But in agentic scenarios, the model needs to find EVERYTHING in the haystack, and just can't afford enough attention for this challenge.
That is the case.
I'm a developer at heart. As a developer, majority of your time is spent running things and hop around environments, e.g. IDE, Cloud, Github. These environments all happen to have full featured bash support, a perfect sandbox for the CLI form factor.
The paradigm change AI brought to the developer world is nothing short of meteoric, but also an exception. Lots of efforts try to generalize the momentum to the next area(s). I won't bet on them.
RAG predated TUI and agents. So to be fair it's quite an achievement to survive the AI evolution. But I feel it is overshadowed by context engineering in the agent era. How does everyone feel about this?
https://www.nicolasbustamante.com/p/the-rag-obituary-killed-by-agents
So this model is still, and meant to be, an OCR model. It does compress long context in a new way, but will have to be trained for other tasks will long context will be applied. OCR itself doesn't need long context.
TLDR: lots of work will have to be done to make this main stream.
https://huggingface.co/blog/onekq/behind-each-token
I am running experiment with DeepSeek OCR BTW
Vision Tokens vs Text Tokens: Understanding the 10ร Compression
onekq-ai/WebApp1K-models-leaderboard
I don't think it is on the roadmap of top labs. Otherwise, you can't explain why Claude 4 has the same 70+ score on SweBench, which is way more challenging than this benchmark.
Neither do I encourage model builders to optimize towards my benchmark, which in itself won't be too hard to top the leaderboard. I just argue that we're still in a very early phase.
What I witness now is still the same pattern: the dropping of generic models strategically optimized towards famous benchmarks. Meanwhile, agent builders (top labs and startups alike) painfully prompt these models to follow their expectations, and pray they won't drift overnight.
onekq-ai/WebApp1K-models-leaderboard
https://www.amazon.com/dp/9999331130
https://elivabooks.com/en/book/book-8002082768