Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
shekkizhΒ 
posted an update 3 days ago
Post
1795
Think AGI is just around the corner? Not so fast.

When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle 🧩 and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.

πŸ”‘ Takeaways
1️⃣ Even the best computer-using models struggle with simple, context-dependent tasks.Β 
2️⃣ Visual perception and reasoning remain major hurdles for multimodal agents.
3️⃣ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn πŸ“‰

πŸ”— Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434

I wonder if it's just bad colour perception, bad reasoning, unexpectedly bad prompting, or some combination of those.

Like if you can somehow give accurate colours for each letter in each row, can the agent do better? (I don't know whether that's possible with OpenAI CUA)

Also, if the problem is with the image tokenization, then it sounds like a CNN would be able to perceive the whole grid better, if there were such a model capable of playing Wordle.

Β·

Images are split into patches and each patch is tokenized - the tokenization is taking into a feature dimension and quantizing. This is probably already has CNN and/or attention. The issue is that of the model not able to reason both color and text in the tokenized space.

We ran about 1000 experiments - different prompting, tool call to different model for recognition, and several other techniques. The results still hold. The paper is a small part of the analysis.πŸ€·β€β™‚οΈ