arxiv:2508.05899

HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Published on Aug 7

Authors:

Abstract

HOLODECK 2.0 generates high-quality, diverse 3D scenes from text descriptions using vision-language models and 3D generative models, with support for interactive editing and procedural game modeling.

AI-generated summary

3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.05899 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.05899 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.05899 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.