Papers
arxiv:2507.06543

Token Bottleneck: One Token to Remember Dynamics

Published on Jul 9
ยท Submitted by bhheo on Jul 11
Authors:
,

Abstract

ToBo is a self-supervised learning method that creates compact, temporally aware visual representations for sequential scene understanding tasks, outperforming baselines in both simulated and real-world environments.

AI-generated summary

Deriving compact and temporally aware visual representations from dynamic scenes is essential for successful execution of sequential scene understanding tasks such as visual tracking and robotic manipulation. In this paper, we introduce Token Bottleneck (ToBo), a simple yet intuitive self-supervised learning pipeline that squeezes a scene into a bottleneck token and predicts the subsequent scene using minimal patches as hints. The ToBo pipeline facilitates the learning of sequential scene representations by conservatively encoding the reference scene into a compact bottleneck token during the squeeze step. In the expansion step, we guide the model to capture temporal dynamics by predicting the target scene using the bottleneck token along with few target patches as hints. This design encourages the vision backbone to embed temporal dependencies, thereby enabling understanding of dynamic transitions across scenes. Extensive experiments in diverse sequential tasks, including video label propagation and robot manipulation in simulated environments demonstrate the superiority of ToBo over baselines. Moreover, deploying our pre-trained model on physical robots confirms its robustness and effectiveness in real-world environments. We further validate the scalability of ToBo across different model scales.

Community

Paper author Paper submitter

Impressive performance on robotics!

Paper author
โ€ข
edited 1 day ago

We introduce Token Bottleneck, which enables both conservative summarization of observed scenes within a bottleneck token and the understanding of temporal dynamics from bottleneck tokens across consecutive scenes. In addition, we provide the real-world demonstrations in our project page.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.06543 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.06543 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.06543 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.