ONEKQ AI

company

https://onekq.ai

onekq_ai

onekq

Activity Feed

AI & ML interests

Benchmark, Code Generation, LLM

Recent Activity

onekq updated a Space about 1 month ago

onekq-ai/WebApp1K-models-leaderboard

onekq authored a paper about 2 months ago

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

onekq authored a paper 10 months ago

A Case Study of Web App Coding with OpenAI Reasoning Models

View all activity

onekq

posted an update about 1 month ago

Post

303

Wow, the new Gemini Pro climbed really fast, after just one month. The inference is quite fast too.

onekq-ai/WebApp1K-models-leaderboard

onekq

updated a Space about 1 month ago

WebApp1K Models Leaderboard

🥇

View leaderboard of web application models

onekq

posted an update about 1 month ago

Post

343

The new R1 is on a par with the old R1. Meet the expectation.
onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update about 1 month ago

Post

342

I'm now testing the new 🐋DeepSeek🐋 R1 and like all reasoning models, it's awfully slow. 🐢🐢

I don't expect it to break SOTA. In fact, it will be a win if it beats the old R1, which already stands very high in the leaderboard.

onekq-ai/WebApp1K-models-leaderboard

IMO the world needs a better vanilla LLM, e.g. 🐋DeepSeek🐋 v4 or v3.5, which we will use in daily life. That's the direction Gemini Flash took which I praised.

onekq

posted an update about 2 months ago

Post

2289

🎉🥳 SOTA!!! 🚀👑

🥇 Claude 4 Opus !!🥇

7 months!! ⌛⌛

I thought the day would never come. But here it is.

onekq-ai/WebApp1K-models-leaderboard

Cost me quite a bit of 💵money 💵 but it is all worth it.

Enjoy and make out of this as much as you can!

4 replies

onekq

posted an update about 2 months ago

Post

2199

Highly recommend the latest Gemini Flash. My favorite Google I/O gift. It ranks behind reasoning models but runs a lot faster than them. It beats DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

Reasoning is good for coding, but not mandatory.

1 reply

onekq

posted an update about 2 months ago

Post

482

Hmm,

codex-mini is a finetuned version of o4-mini, but on my leaderboard it performs worse than its base.

onekq-ai/WebApp1K-models-leaderboard

onekq

authored a paper about 2 months ago

Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation

Paper • 2505.09027 • Published May 13

onekq

posted an update about 2 months ago

Post

946

This paper introduced the notion of "Tests as Prompt". It compiled results and findings of WebApp1K published in previous three papers.

https://huggingface.co/papers?q=2505.09027

The central argument here is that test-driven development is a natural fit to LLMs, which scale better than humans. I bet the future will see thousands of such leaderboards (many more proprietary ones), each dominated by a specialized model.

onekq

posted an update about 2 months ago

Post

462

If you also tuned into Altman's second congress hearing (first in 2023) along with other AI executives, my takeaway is two words: New Deal (by FDR almost a century ago).

The causal link is quite fascinating and worthy of a few blogposts or deep research queries, but I won't have more time for this (I really wish so), so here goes.

* AI workload loves GPUs because they allocate more transistors than CPUs for computing, and pack them by high-bandwidth memory
* More computing in the small physical space -> more power draw and more heat dissipation
* more heat dissipation -> liquid cooling
* new cooling and heavier power draw -> bigger racks (heavier and taller)
* bigger racks -> (re)building data centers
* new data centers with higher power demand (peak and stability) -> grid update and nuclear power

onekq

posted an update 2 months ago

Post

2315

The new Mistral medium model is very impressive for its size. Will it be open sourced given the history of Mistral? Does anyone have insights?

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 2 months ago

Post

3287

This time Gemini is very quick with API support on its 2.5 pro May release. The performance is impressive too, now it is among top contenders like o4, R1, and Claude.

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 2 months ago

Post

607

Okay, Grok 3 has API support too, and beats Gemini 2.5, but is behind QwQ 32b and DeepSeek v3

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 2 months ago

Post

1756

I didn't noticed that Gemini 2.5 (pro and flash) has been silently launched for API preview. Their performance is solid, but below QwQ 32B and the latest DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

2 replies

onekq

posted an update 2 months ago

Post

1815

I tested Qwen3 235b and 32b and they are both worse than Qwen2.5 32b.
onekq-ai/WebApp1K-models-leaderboard

I used non-thinking mode because the thinking mode is too slow 🐢🐢🐢 to be usable in any way.

Sigh ...

12 replies

onekq

posted an update 2 months ago

Post

496

The Qwen3 235B (MoE) is awfully slow 🐢🐢🐢.

I heard it is able to switch between reasoning and non-reasoning, but for my question, it always goes straight to the reasoning mode without an override switch. I tried Fireworks, DeepInfra, and OpenRouter, and they are all the same.

What is your experience with Qwen3?

2 replies

onekq

posted an update 2 months ago

Post

2005

AxB stand for Approximately xB or Activating xB (for a Mixture-of-Expert model), this is really interesting naming 😅

Qwen/Qwen3-235B-A22B
Qwen/Qwen3-30B-A3B

1 reply

onekq

posted an update 3 months ago

Post

2015

I've recently attended a panel on AI applications. The panelists are managers/directors of Fortune 500 companies. These people make things happen and own results, so their stories and pain points are fresh.

(1) Models are used EVERYWHERE, customer facing and internal support, etc.
(2) A successful application must improve one of the following: revenue (💵💵), cost (💵💵), CSAT (still 💵💵)
(3) They proactively search on 🤗HF🤗 for models and use them. Open source models (especially small ones) can flexibly fit into their existing workflows/infras, which enable them to deliver, and fast.
(4) The main barrier for adoption is license. A director told me they picked a model and finetuned it, then learned they would have to share enhancements. As a result, they dropped this model and the million dollar impact went to another model.

So to fellow model builders:
(1) celebrate that our work is useful and generate lots of values
(2) make your license permissive if you want maximum impact

1 reply

onekq

posted an update 3 months ago

Post

821

Heard good things about this model and no inference providers support it ...

THUDM/GLM-4-9B-0414

6 replies

onekq

posted an update 3 months ago

Post

433

This post discussed the same trend as the Sutton post, but is more concrete and down-to-earth.

https://ysymyth.github.io/The-Second-Half/

Two takeaways for me. (1) deep neural network is the backbone to unify everything. RLHF will stand the test of time because it brings two distinct fields (NLP and RL) onto the same model weights. (2) language model will continue to play a central role in the era of agent. It probably won't be the end game to AGI, but definitely not offramp.

AI & ML interests

Recent Activity

Team members 1

onekq-ai's activity

WebApp1K Models Leaderboard