Abstract
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.
Community
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize across various tasks. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. We propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks, unlocking significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies (2025)
- TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation (2025)
- UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding (2025)
- ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement (2025)
- TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference (2025)
- Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training (2025)
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I would like to ask whether there are other experiments supporting the superiority of end-to-end modeling under this paradigm.
For a 1.5B model, the performance improvement brought by full fine-tuning on specific tasks is predictable.
Therefore, the simple ablation study on tokenizer tuning does not seem to sufficiently demonstrate the core of the research described in this paper.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper