arxiv:2507.04559

MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Published on Jul 6

Authors:

Abstract

A novel Mamba-based encoder-decoder architecture and channel-split quantization scheme improve discrete video tokenization for autoregressive generative modeling.

AI-generated summary

Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.04559 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.04559 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.04559 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.