MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
Abstract
MesaTask, an LLM-based framework with a Spatial Reasoning Chain, generates realistic tabletop scenes aligned with task descriptions using DPO algorithms.
The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/
Community
🤖 Robots that actually understand “put the fruit in the bowl”?
Meet MesaTask 🚀 [NeurIPS 2025 Spotlight]
✨ 10K+ physics-verified tabletop scenes
✨ 12K+ curated 3D assets
✨ Outperforms baselines in alignment, realism & physicality
All data & code are OPEN — try it now ⚡
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation (2025)
- RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph (2025)
- InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts (2025)
- HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing (2025)
- Causal Reasoning Elicits Controllable 3D Scene Generation (2025)
- SAGE: Scene Graph-Aware Guidance and Execution for Long-Horizon Manipulation Tasks (2025)
- Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper