Papers
arxiv:2502.15021

Simpler Fast Vision Transformers with a Jumbo CLS Token

Published on Feb 20
Authors:
,
,
,
,

Abstract

Jumbo enhances vision transformers by expanding and processing the CLS token with self-attention and a wider FFN, improving accuracy and efficiency on ImageNet and time series tasks.

AI-generated summary

We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.15021 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.15021 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.15021 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.