arxiv:2507.09144

I^{2}-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Published on Jul 12

Authors:

Abstract

$I^{2}$-World, an encoder-decoder framework, efficiently tokenizes and forecasts 4D occupancy in 3D scenes, achieving state-of-the-art performance with high computational efficiency.

AI-generated summary

Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose I^{2}-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike decoder-only GPT-style autoregressive models, I^{2}-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to enable high-level control over scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that I^{2}-World achieves state-of-the-art performance, outperforming existing methods by 25.1\% in mIoU and 36.9\% in IoU for 4D occupancy forecasting while exhibiting exceptional computational efficiency: it requires merely 2.9 GB of training memory and achieves real-time inference at 37.0 FPS. Our code is available on https://github.com/lzzzzzm/II-World.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.09144 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.09144 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.09144 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.