It now includes : - a live stream of the progress being made on the task (see included video), - The following components: 1. Automatic prompt optimization 2. An orchestrator deciding which agent to call dynamically including feedback from a human (human-in-the-loop) 3. A coding agent to complete the task 4. A code reviewing agent to iteratively provide feedback to improve the code generated by the coding agent until the code meets the required criteria after which it is approved. 5. A testing agent that tests the approved code or provides information on how to test it. 6. A documentation agent that provides documentation and a help message for the approved and tested code.
reacted to sayakpaul's
post with ๐โ๐คฏ๐ค2 months ago
A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
๐ย A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
๐๐ฑย A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
โ ย An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- ๐ Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although itโs much smaller - โ๏ธ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.