OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Abstract
In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at https://github.com/agentsea/osuniverse.
Community
🚀 We have released a benchmark we had to build: OSUniverse 🌌
GitHub: https://github.com/agentsea/osuniverse
Arxiv paper: https://arxiv.org/abs/2505.03570
Landing page: https://agentsea.github.io/osuniverse/
🤔 Why did we have to build it?
Because we needed a clear, simple way to test pure multimodal GUI-navigation agents on desktop & web tasks that are easy for the average white-collar worker but still hard for machines 🖥️🤖.
We do thousands of these tasks every day without a second thought, yet agents still fall far short of the average office worker 🧑💼📉.
Can a bot…
🖱️drag & drop?
🎨 draw something in GIMP/Photoshop?
🤖 beat a CAPTCHA?
📑 juggle multiple apps?
📊 distill info, summarize it, & drop it into a formula-filled spreadsheet?
Spoiler alert: (mostly not!)
We take all this for granted because it's natural and intuitive to us but agents still struggle with many of these tasks, even the most cutting edge models like Qwen 2.5 VL, Operator and Claude Computer Use.
We also wanted a benchmark that's easy to run and expand with new tests. That means we wanted it fully Dockerized, flexible to agentic architecture and models, with non-deterministic scoring, and precise prompts that test agents with ever escalating complexity on desktop and web tasks.
Here's a more detailed rundown of what we built. The benchmark is:
🧩 Independent of the style of agent architecture (i.e., not just ReACT but anything anyone could dream up, including multi-agent architectures).
⚛️ Non-deterministic so that we could test more complex scenarios that aren't easily scored with simple heuristics, like whether an agent actually drew a smiley face or a flower.
🤖Validated by AI with a difference of under 2% versus a human scorer.
⚡Fast and lightweight: all the tests run in a Docker container so they spin up fast and easy on any platform, no VMs to muck with at all.
🥞Easily extensible: every test is configured via YAML.
🎲Escalating complexity that gets harder at each level, advancing from paper level tasks where the agent just has to see the screen and accurately describe it, to gold tasks, complex multi-app scenarios that test everything from the ability to draw, to dragging and dropping, to distilling information from one app and inputting it into another.
A few benchmarks tried to address this already, most notably OSWorld. We took a lot of inspiration from OSWorld and deeply respect that team. It's a good benchmark but it has some limitations for us, most notably:
🐢 It runs in VMWare/VirtualBox VM (or via Docker, which simply runs a KVM VM inside Docker). The reason is Windows GUIs don't run in Docker, but the truth is you don't need Windows to test. If a model can use Libre Office Calc, it can be trained to use Excel, so simpler and easier is better here. We also much prefer the speed and reproducibility of Docker.
🔗 The benchmark is chained to ReACT-style agents and it’s hard to implement any other kind of agent architecture. It’s possible to write an alternative but it's not easy or documented. A SOTA agent may have a breakthrough that uses a completely different style of interacting with the world.
🖍️The tests require deterministic validation. Many strong agentic use cases simply cannot be tested deterministically. A deterministic test can't tell if an agent successfully drew a 🙂 in GIMP.
🤷♂️ Finally, the biggest problem is that the prompts for the benchmark are vague and require a tremendous amount of inductive and deductive reasoning for a machine to even interpret what to do, much less do its task. This significantly hinders model performance, resulting in much lower scores – not because a model cannot do the task in many cases, but because it can't understand the prompt.
In contrast, we designed our benchmark for:
✅ Multimodality: designed to rely on vision only, without any extra knowledge about the environment.
✅ Diversity: contains 160 tasks across 5 levels of complexity and 9 categories; all tasks are carefully crafted to be challenging and representative of real-world scenarios that are easy for the average office worker but hard for machines.
✅ Automated validation: includes a Gemini-powered validator with an average error rate less than 2%; we support 4 types of validations in each test case: validating the textual output of the agent, the final screenshot of the desktop, the agent trajectory, and the output of the arbitrary bash command after the agent has finished the task.
✅ Flexibility: you can add new agents, runtimes, and validators.
Give it a spin & tell us what you think! 🏃♂️💬We're open to any and all feedback and would love to have outside teams put models through their paces and submit the results.
Submit results with full run logs 👉 research [at] http://kentauros.ai
Help us grow the benchmark, and our open source desktop tools AgentDesk & AgentD 🙌
Next up: Version 2 — more silver & gold tasks 🥈🥇; expect SOTA scores to plunge < 20% 📉
Thanks for reading! 💙
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper