Abstract
We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at https://github.com/manantomar/video-occupancy-models{github.com/manantomar/video-occupancy-models}.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning to Play Atari in a World of Tokens (2024)
- Video Prediction Models as General Visual Encoders (2024)
- Visual Representation Learning with Stochastic Frame Prediction (2024)
- iVideoGPT: Interactive VideoGPTs are Scalable World Models (2024)
- Efficient World Models with Context-Aware Tokenization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper