CU-1: RF-DETR-M for Computer Use Agent

Model Description

CU-1 (Computer Use Agent v1) is a fine-tuned implementation of RF-DETR-M specifically optimized for autonomous computer interaction. This model serves as the visual perception backbone for our computer use agent, enabling real-time UI element detection and multi-action task automation across diverse graphical interfaces.

Key Features:

70.8% accuracy on WebClick benchmark (vs 58.8% for OmniParser)
Multi-action task support beyond single-click benchmarks
Optimized training pipeline with merged COCO-format datasets
Class-agnostic detection supporting diverse UI elements

Methodology Revision Notice

Important: This paper presents revised benchmark results following a methodological correction. Our initial evaluation used default YOLO detection parameters and baseline prompts, which do not reflect optimal performance conditions for either model. We subsequently re-evaluated both CU-1 and OmniParser V2 using their respective optimized detection thresholds (0.35 for CU-1, 0.05 for OmniParser V2 from official sources) and refined prompts for improved task instruction clarity. Both sets of results are presented for transparency, with the optimized evaluation better representing real-world deployment scenarios where parameters and prompts are tuned for specific use cases.

Training Architecture

Dataset Architecture

The model is trained on a merged COCO-format dataset combining multiple UI detection sources, ensuring broad coverage across platforms, applications, and visual styles. The class-agnostic approach enables detection of any clickable element without predefined categories.

Training Dataset Composition:

Dataset	Train	Valid	Test	Total
months.v1i.coco	173	25	12	210
all-item-merged.v1-100-60.coco	334	35	14	383
Web.v3i.coco	493	264	90	847
Website elements.v3i.coco	133	10	3	146
Website Elements.v16i.coco	679	55	21	755
Website.v1i.coco	844	242	123	1,209
TOTAL	2,656	631	263	3,550

Training Configuration:

Training images: 2,656 annotated UI screenshots
Epochs: 30
Batch size: 8
Learning rate: 5e-4

WebClick Benchmark Evaluation

Benchmark Methodology: The WebClick benchmark evaluates whether models can correctly identify clickable elements at specified target coordinates. Each sample returns a binary result (success/failure), with the final accuracy calculated as the average across all samples.

Evaluation performed on 1,639 samples across three categories using Gemini 2.5 Pro as the decision-making LLM.

Technical Parameters

Detection Configuration:

CU-1:
- Confidence threshold: 0.35
- Model: RF-DETR-Medium
OmniParser:
- Confidence threshold: 0.05
- IOU threshold: 0.1
- Model: YOLOv8-based icon detection

Annotation System: Both models use numbered bounding box annotations where each detected UI element is assigned a unique ID displayed above its bounding box. The LLM (Gemini 2.5 Pro) analyzes the annotated screenshot and selects elements by their ID numbers to perform click actions. Each bounding box is drawn with a thin border for visibility, with the ID number displayed in a black label box with white text positioned above each element.

LLM Decision Process: The benchmark evaluates the agent in a constrained scenario where it must select a single element to click. The LLM receives:

A task instruction (e.g., "Click on March 19th in the calendar")
An annotated screenshot showing all detected elements with their IDs

The LLM is prompted to analyze the image and respond with a tool call in the format:

{"name": "click", "parameters": {"id": <box_id>}}

Note that the full CU-1 agent supports multiple actions (click, type, scroll, press, right_click, double_click, etc.), but for benchmark consistency, only the click action is evaluated. This tests the fundamental capability of correctly identifying and selecting UI elements.

Figure 1: BBC News website showing numbered annotations for all interactive elements including navigation items, article links, and media controls.

Figure 2: Airbnb search interface with numbered annotations on calendar dates, property listings, filters, and interactive controls.

Results

Metric	CU-1 (RF-DETR-M)	OmniParser	Improvement
Overall Accuracy	70.8%	58.8%	+20%
Agent Browse	66%	58%	+14%
Calendars	64%	46%	+39%
Human Browse	83%	73%	+14%

Table 1: Performance comparison between CU-1 and OmniParser across WebClick benchmark categories (optimized parameters)

Methodology Note:

Initial evaluation used default YOLO detection parameters, yielding OmniParser accuracy of 40.7%. Following parameter optimization (confidence threshold 0.05, IOU threshold 0.1 from official deployment configurations) and refined prompts, OmniParser improved to 58.8%. CU-1 improved from 67.5% to 70.8% solely through enhanced system prompts, maintaining its threshold of 0.35 throughout both evaluations.

Comparison showing impact of parameter optimization on OmniParser performance (40.7% → 58.8%)

Category-level results demonstrating performance gains from optimized detection parameters and improved prompts

Category Breakdown:

Agent Browse: Automated navigation tasks requiring identification of typical web elements like buttons, links, and form fields
Calendars: Date selection interfaces with dense grid layouts of small, similar-looking elements
Human Browse: Real-world web browsing scenarios with diverse UI patterns and complex page structures

CU-1 shows particularly strong performance on Calendar tasks (+39% improvement), demonstrating superior ability to distinguish between densely packed, visually similar UI elements - a critical capability for autonomous agents.

Detection Statistics:

Average elements detected per image: CU-1 detects 82.3 elements vs OmniParser's 50.6 elements
Processing time: CU-1 averages 0.82s per image vs OmniParser's 0.77s

Visual Performance Comparison

Examples showing CU-1 (blue boxes) vs OmniParser (orange boxes) detection capabilities:

Figure 5: Calendar date selection interface with dual-month view (April-May 2026). CU-1 detects 103 interactive elements including individual calendar dates for both months, navigation arrows, date input fields, and action buttons (Reset, Cancel, Apply), while OmniParser only identifies 47 elements, missing numerous calendar dates and form controls.

Figure 6: Spotify music streaming platform showing search results for artist "Gojira". CU-1 identifies 98 elements including navigation tabs (Tracks, Albums, Playlists, Artists, Episodes, Profiles), individual track rows with action buttons (play, like, more options), artist information, and media controls, compared to OmniParser's 60 detections that miss several interactive elements and granular controls.

WebClick benchmark click decision examples with Gemini Pro 2.5 (green box: ground truth, blue: CU-1 selection, orange: OmniParser selection):

Figure 7: Travel booking website with flight search and calendar date picker (April-May 2025). Query: Click task on calendar interface CU-1 correctly identifies and clicks the target date element (May 27th) within the dense calendar grid, while OmniParser fails to locate the correct date element.

Figure 8: Booking.com accommodation search with stay duration selector. Query: Select stay duration option CU-1 demonstrates superior fine-grained detection by accurately identifying both the "A month" text label and its associated radio button as separate interactive elements, enabling precise selection. OmniParser fails to detect these subtle UI components, missing the granular structure of the duration selector interface.

Benchmark Context:

The WebClick benchmark evaluates single-click accuracy on web UI tasks. While our agent achieves 70.8% accuracy (compared to 58.8% for OmniParser), it's important to note that CU-1 is designed for multi-action sequences beyond the single-click paradigm:

Sequential Actions: Screenshot before/after each action for context awareness
Complex Workflows: Navigate through multi-step processes autonomously
Error Recovery: Adaptive behavior based on UI state changes

Video 1: Example of CU-1 agent performing a multi-step task requiring several sequential actions to achieve the final result.

Agent Architecture & Capabilities

Visual Processing Pipeline

CU-1 powers a sophisticated computer use agent with multiple detection modes:

# From agent_cv.py - Core processing loop
async def run_agent(user_query: str):
    # 1. Capture screenshot
    screenshot = capture_screenshot()

    # 2. Process with RF-DETR (CU-1)
    boxes, annotated, atlas = process_image(screenshot)

    # 3. Multiple methods to communicate detected elements to the LLM
    #    (visual annotations, coordinates, atlas grids, etc.)

    # 4. LLM decision making with visual context
    # 5. Execute action and capture result

Agent System Architecture

CU-1 serves as the visual perception foundation for an autonomous computer use agent capable of complex multi-step interactions across any graphical interface.

Key Differentiators

Beyond Single-Click Benchmarks:

While WebClick evaluates single-click accuracy, CU-1 excels at complex multi-action sequences:

# Example: Multi-step form submission
user_query = "Fill out the registration form with my information"

# Agent performs:
# 1. Screenshot → Detect form fields
# 2. Click name field → Type name
# 3. Screenshot → Verify input
# 4. Click email field → Type email
# 5. Screenshot → Verify input
# 6. Select country dropdown → Choose option
# 7. Check agreement boxes
# 8. Click submit button
# 9. Screenshot → Confirm success

Contextual Awareness:

The agent maintains state across actions through before/after screenshots, enabling:

Error detection and recovery
Verification of action success
Handling of semi-dynamic content

LLM Integration:

Seamless integration with vision-language models (Gemini, GPT-4V, Claude) for intelligent decision-making based on visual context and user intent.

Model Usage

Quick Start

The model.pth file contains the model weights. To use them, you need to install the required dependencies first:

# Core requirements
pip install torch torchvision opencv-python pillow

# RF-DETR library
pip install rfdetr

from rfdetr.detr import RFDETRMedium
import cv2
import numpy as np

# Load the model with your trained weights
model = RFDETRMedium(pretrain_weights="model.pth", resolution=1600)

# Process an image
image = cv2.imread("screenshot.png")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Run detection
detections = model.predict(image_rgb, threshold=0.3)

# Get results
boxes = detections.xyxy  # Bounding boxes
scores = detections.confidence  # Confidence scores

print(f"Detected {len(boxes)} UI elements")

Limitations & Future Work

Current Limitations:

70.8% single-click accuracy leaves room for improvement
Performance degrades on very small UI elements (<20px)
Limited to static screenshots (no video/animation support yet)

Authors

Léo Appourchaux - Lead Developer at TW3 Partners
Noé Brandolini - R&D at TW3 Partners - Student at École Centrale d'Électronique
David Soeiro-Vuong - R&D at Racine.ai - Student at École Centrale d'Électronique
Matis Despujols - R&D at TW3 Partners
Paul Lemaistre - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique

About Ecole Centrale d'Electronique:

ECE, a multi-program, multi-campus, and multi-sector engineering school specializing in digital engineering, trains engineers and technology experts for the 21st century, capable of meeting the challenges of the dual digital and sustainable development revolutions. French Engineering School ECE

Citations

Model Citation

@misc{cu1-computer-use-agent-2025,
  author = {CU-1 Team},
  title = {CU-1: RF-DETR-M for Computer Use Agent},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/CU-1/rf-detr-computer-use}}
}

Benchmark Dataset

@misc{webclick2024,
  title = {WebClick Dataset},
  type = {Benchmark Dataset},
  author = {Hcompany},
  howpublished = {\url{https://huggingface.co/datasets/Hcompany/WebClick}},
  journal = {Hugging Face Datasets},
  publisher = {Hugging Face},
  year = {2024}
}

Training Datasets

@misc{months_dataset,
  title = {months Dataset},
  type = {Open Source Dataset},
  author = {YOLO},
  howpublished = {\url{https://universe.roboflow.com/yolo-ujkjn/months}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {jul}
}

@misc{all-item-merged_dataset,
  title = {all-item-merged Dataset},
  type = {Open Source Dataset},
  author = {pc},
  howpublished = {\url{https://universe.roboflow.com/pc-fjqbc/all-item-merged}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2022},
  month = {sep}
}

@misc{web-l67bi_dataset,
  title = {Web Dataset},
  type = {Open Source Dataset},
  author = {Vitaliy Roshko},
  howpublished = {\url{https://universe.roboflow.com/vitaliy-roshko-fu9tw/web-l67bi}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

@misc{website-elements-aneyv_dataset,
  title = {Website elements Dataset},
  type = {Open Source Dataset},
  author = {Dibyajyoti Mohanty},
  howpublished = {\url{https://universe.roboflow.com/dibyajyoti-mohanty-eqerk/website-elements-aneyv}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2024},
  month = {jun}
}

@misc{website-elements-064fn_dataset,
  title = {Website Elements Dataset},
  type = {Open Source Dataset},
  author = {workspace},
  howpublished = {\url{https://universe.roboflow.com/workspace-8hc0w/website-elements-064fn}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

@misc{website-vsoao_dataset,
  title = {website Dataset},
  type = {Open Source Dataset},
  author = {ai research},
  howpublished = {\url{https://universe.roboflow.com/ai-research-zk9sn/website-vsoao}},
  journal = {Roboflow Universe},
  publisher = {Roboflow},
  year = {2025},
  month = {aug}
}

Downloads last month: 26

Space using racineai/CU-1 1

Evaluation results

Click Accuracy on WebClick Benchmark
self-reported

70.800

View on Papers With Code