Abstract
OceanGym is a benchmark for underwater embodied agents using Multi-modal Large Language Models to address challenges in perception, planning, and adaptability in harsh ocean environments.
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
Community
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UnderwaterVLA: Dual-brain Vision-Language-Action architecture for Autonomous Underwater Navigation (2025)
- DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring (2025)
- AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation (2025)
- Sight Over Site: Perception-Aware Reinforcement Learning for Efficient Robotic Inspection (2025)
- PhysiAgent: An Embodied Agent Framework in Physical World (2025)
- DUViN: Diffusion-Based Underwater Visual Navigation via Knowledge-Transferred Depth Features (2025)
- Shared Autonomy through LLMs and Reinforcement Learning for Applications to Ship Hull Inspections (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper