Papers
arxiv:2506.17218

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Published on Jun 20
· Submitted by XueyangY on Jun 23
Authors:
,
,
,

Abstract

Mirage enhances vision-language models by integrating latent visual tokens into text decoding to improve multimodal reasoning without generating explicit images.

AI-generated summary

Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

Community

Paper author Paper submitter
edited 3 days ago

We investigate whether current VLMs can enhance reasoning by generating visual thoughts. Existing unified models often struggle to produce coherent interleaved reasoning trajectories and require extensive pretraining and modality alignment. To address these challenges, we introduce our Machine Mental Imagery (Mirage) framework that interleaves textual and visual reasoning by generating implicit visual tokens alongside text tokens. Rather than rendering pixel-level images, Mirage chooses to "think visually" by recasting its hidden states as multimodal tokens, enabling seamless progression along a reasoning trajectory. More details are available on our project page and code repo.
project page: https://vlm-mirage.github.io/
code: https://github.com/UMass-Embodied-AGI/Mirage

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.17218 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.17218 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.17218 in a Space README.md to link it from this page.

Collections including this paper 3