ONEKQ AI

company

AI & ML interests

Benchmark, Code Generation, LLM

Recent Activity

onekq-ai's activity

onekqย 
posted an update 1 day ago
onekqย 
posted an update 8 days ago
onekqย 
posted an update 10 days ago
view post
Post
330
I'm now testing the new ๐Ÿ‹DeepSeek๐Ÿ‹ R1 and like all reasoning models, it's awfully slow. ๐Ÿข๐Ÿข

I don't expect it to break SOTA. In fact, it will be a win if it beats the old R1, which already stands very high in the leaderboard.

onekq-ai/WebApp1K-models-leaderboard

IMO the world needs a better vanilla LLM, e.g. ๐Ÿ‹DeepSeek๐Ÿ‹ v4 or v3.5, which we will use in daily life. That's the direction Gemini Flash took which I praised.
onekqย 
posted an update 16 days ago
view post
Post
2198
๐ŸŽ‰๐Ÿฅณ SOTA!!! ๐Ÿš€๐Ÿ‘‘

๐Ÿฅ‡ Claude 4 Opus !!๐Ÿฅ‡

7 months!! โŒ›โŒ›

I thought the day would never come. But here it is.

onekq-ai/WebApp1K-models-leaderboard

Cost me quite a bit of ๐Ÿ’ตmoney ๐Ÿ’ต but it is all worth it.

Enjoy and make out of this as much as you can!
  • 4 replies
ยท
onekqย 
posted an update 18 days ago
view post
Post
2193
Highly recommend the latest Gemini Flash. My favorite Google I/O gift. It ranks behind reasoning models but runs a lot faster than them. It beats DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

Reasoning is good for coding, but not mandatory.
  • 1 reply
ยท
onekqย 
posted an update 22 days ago
onekqย 
posted an update 25 days ago
view post
Post
943
This paper introduced the notion of "Tests as Prompt". It compiled results and findings of WebApp1K published in previous three papers.

https://huggingface.co/papers?q=2505.09027

The central argument here is that test-driven development is a natural fit to LLMs, which scale better than humans. I bet the future will see thousands of such leaderboards (many more proprietary ones), each dominated by a specialized model.
onekqย 
posted an update 27 days ago
view post
Post
460
If you also tuned into Altman's second congress hearing (first in 2023) along with other AI executives, my takeaway is two words: New Deal (by FDR almost a century ago).

The causal link is quite fascinating and worthy of a few blogposts or deep research queries, but I won't have more time for this (I really wish so), so here goes.

* AI workload loves GPUs because they allocate more transistors than CPUs for computing, and pack them by high-bandwidth memory
* More computing in the small physical space -> more power draw and more heat dissipation
* more heat dissipation -> liquid cooling
* new cooling and heavier power draw -> bigger racks (heavier and taller)
* bigger racks -> (re)building data centers
* new data centers with higher power demand (peak and stability) -> grid update and nuclear power
onekqย 
posted an update 30 days ago
view post
Post
2279
The new Mistral medium model is very impressive for its size. Will it be open sourced given the history of Mistral? Does anyone have insights?

onekq-ai/WebApp1K-models-leaderboard
onekqย 
posted an update about 1 month ago
view post
Post
3279
This time Gemini is very quick with API support on its 2.5 pro May release. The performance is impressive too, now it is among top contenders like o4, R1, and Claude.

onekq-ai/WebApp1K-models-leaderboard
onekqย 
posted an update about 1 month ago
onekqย 
posted an update about 1 month ago
view post
Post
1753
I didn't noticed that Gemini 2.5 (pro and flash) has been silently launched for API preview. Their performance is solid, but below QwQ 32B and the latest DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard
  • 2 replies
ยท
onekqย 
posted an update about 1 month ago
view post
Post
1798
I tested Qwen3 235b and 32b and they are both worse than Qwen2.5 32b.
onekq-ai/WebApp1K-models-leaderboard

I used non-thinking mode because the thinking mode is too slow ๐Ÿข๐Ÿข๐Ÿข to be usable in any way.

Sigh ...
ยท
onekqย 
posted an update about 1 month ago
view post
Post
495
The Qwen3 235B (MoE) is awfully slow ๐Ÿข๐Ÿข๐Ÿข.

I heard it is able to switch between reasoning and non-reasoning, but for my question, it always goes straight to the reasoning mode without an override switch. I tried Fireworks, DeepInfra, and OpenRouter, and they are all the same.

What is your experience with Qwen3?
  • 2 replies
ยท
onekqย 
posted an update about 1 month ago
onekqย 
posted an update about 2 months ago
view post
Post
2014
I've recently attended a panel on AI applications. The panelists are managers/directors of Fortune 500 companies. These people make things happen and own results, so their stories and pain points are fresh.

(1) Models are used EVERYWHERE, customer facing and internal support, etc.
(2) A successful application must improve one of the following: revenue (๐Ÿ’ต๐Ÿ’ต), cost (๐Ÿ’ต๐Ÿ’ต), CSAT (still ๐Ÿ’ต๐Ÿ’ต)
(3) They proactively search on ๐Ÿค—HF๐Ÿค— for models and use them. Open source models (especially small ones) can flexibly fit into their existing workflows/infras, which enable them to deliver, and fast.
(4) The main barrier for adoption is license. A director told me they picked a model and finetuned it, then learned they would have to share enhancements. As a result, they dropped this model and the million dollar impact went to another model.

So to fellow model builders:
(1) celebrate that our work is useful and generate lots of values
(2) make your license permissive if you want maximum impact
  • 1 reply
ยท
onekqย 
posted an update about 2 months ago
view post
Post
820
Heard good things about this model and no inference providers support it ...

THUDM/GLM-4-9B-0414
  • 6 replies
ยท
onekqย 
posted an update about 2 months ago
view post
Post
432
This post discussed the same trend as the Sutton post, but is more concrete and down-to-earth.

https://ysymyth.github.io/The-Second-Half/

Two takeaways for me. (1) deep neural network is the backbone to unify everything. RLHF will stand the test of time because it brings two distinct fields (NLP and RL) onto the same model weights. (2) language model will continue to play a central role in the era of agent. It probably won't be the end game to AGI, but definitely not offramp.