Introduction

We are pleased to introduce the first model checkpoint of the Assistant-Zero project, an early milestone in our development of general-purpose visual agents for on-device personal assitant. This release validates the core components of our end-to-end training pipeline, including data generation, supervised fine-tuning (SFT), and the infrastructure for reinforcement learning (RL).

Project Motivation

As large language models (LLMs) continue to evolve, agent-based systems have emerged as a highly promising application direction. INF has already deployed several agent products; however, these systems are limited by:

Restricted information access – Current agents retrieve data only from the web, static files, or user prompts. This prevents them from adapting flexibly to the user's real-time needs.
Limited external interaction – Most agents rely on pre-defined APIs or manually integrated tools for action execution, which lack scalability across applications.

While these limitations may not be prominent in business-oriented applications—where tasks are often structured and environments are pre-integrated—they become critical obstacles when building consumer-facing products. In real-world usage, a general-purpose personal assistant must operate across diverse and dynamic app environments, respond to highly personalized user needs, and interact through interfaces that are not explicitly designed for automation. Without flexible access to information and scalable interaction mechanisms, current agents struggle to support tasks such as booking hotels during a trip, comparing products across platforms, or assisting with spontaneous user decisions. These are precisely the scenarios where true personal assistants must excel.

To address these challenges, we focus on building agents that operate directly on the smartphone interface—a platform that reflects how users naturally interact with digital environments. These Visual-agents observe screen content visually and interact via taps and text, similar to human users. This shift enables broader application control without task-specific engineering.

System Overview

The released model is trained under our complete data pipeline and used for evaluating our newly developed platform INF-Aspire (INF Asynchronous Reinforcement Learning Platform). Specifically:

We have restructured a full Android-based simulation and evaluation environment based on the Model Context Protocol (MCP), enabling interactive, agent-centric reinforcement learning.
The first model checkpoint has been supervised-trained on simulation data and is designed for tasks in Android World and Android Control environments.
This checkpoint serves as:
- A validation of our data production pipeline and augmentation strategies.
- A testbed for our supervised fine-tuning (SFT) methodology.
- An initialization for future reinforcement learning (RL) experiments with simulation envrionments using the INF-Aspire framework.

Experiments

Model	Android Control	Android World
Qwen2.5-VL-7B	60.1	-
Qwen2.5-VL-72B	67.4	35.0
UI-TARS-7B	72.5	33.0
UI-TARS-72B	74.7	46.6
INF-AZ-7B-0524	69.5	47.0

Future Plan: Reinforcement Learning with INF-Aspire

The next phase of the Assistant-Zero project will focus on scaling reinforcement learning using the INF-Aspire framework. This phase is critical for enabling the agent to move beyond SFT and acquire robust, adaptive behavior through trial and feedback.

Key directions include:

Reward design and evaluation: We will develop reward functions, especially the design of process rewards that reflect real-world task objectives and use the current checkpoint as a baseline for performance tracking.
Curriculum learning: Progressive task structuring will be used to gradually increase difficulty, allowing the agent to build and refine capabilities over time.
Asynchronous architecture for RL: INF-Aspire enables scalable training across distributed environments by decoupling data collection and policy updates, significantly improving efficiency.
Simulation Augmentation: Continued refinement and augmentation of the Android-based simulation environment will ensure that reinforcement learning is grounded in realistic interaction dynamics.

Contributor

Assistant-Zero team

Listed in Alphabetical Order

Assistant-Zero project is currently supported by several researchers and engineers with passion to the on-device ultimate personal intelligent assistant. The work presented here is not the result of any commercial mandate, but rather a reflection of our collective curiosity and technical exploration.

🛠 All contributors have contributed equally to the development and progress of Assistant-Zero. Contributions are listed in no particular order.

👨‍💻 Tan Xiaoyu – Project Lead, Agent Design, Training
👨‍💻 Qu Chao – Algorithm Design, Training
👨‍💻 Hao Jiaran - Evaluation, Agent Design, Simulation
👨‍💻 Yao Tianchu - Data Pipeline, Training
👨‍💻 Lu Dakuan - Evaluation
👨‍💻 Hu Hongqing - Model Quantization, Deployment
👨‍💻 Songliu Yihan - Training
👨‍💻 Wei Lingfeng - Inference, Android Simulation
👨‍💻 Ai Xi - Technical Support

License Agreement

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license with additional terms. License.

Contact

Xiaoyu Tan: [email protected]

Cititation

If you find our work helpful, feel free to give us a cite.

@misc{inftech_pi_zero2024,
  author       = {Assistant-Zero Team},
  title        = {Assistant-Zero: Pioneering the Future of Personal AI Assistants},
  year         = {2025},
  url          = {https://github.com/WilliamBUG/Assistant-Zero/blob/main/README.md},
  note         = {Accessed: 2025-05-24}
}

Downloads last month: 5

Safetensors

Model size

8.29B params

Tensor type

BF16

Model tree for infly/INF-AZ-7B-0524

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(748)

this model

Quantizations

2 models