Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Jiwan Chung*  Junhyeok Kim*  Siyeol Kim  Jaeyoung Lee  Minsoo Kim  Youngjae Yu

arXiv HuggingFace

Installation

conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Demo

Gradio Web UI

Highly Recommended as the copy tokens are displayed on image.

python run_gradio.py

Inference

python inference.py

The script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.

Coming Soon

  • Inference code
  • Training data
  • Evaluation code
  • Training code

Citation

If you find our work valuable, please cite:

@misc{chung2025dontlookoncemultimodal,
      title={Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation}, 
      author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
      year={2025},
      eprint={2505.18842},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18842}, 
}
Downloads last month
11
Safetensors
Model size
8.32B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support