Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation
Jiwan Chung*β Junhyeok Kim*β Siyeol Kimβ Jaeyoung Leeβ Minsoo Kimβ Youngjae Yu
Installation
conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Demo
Gradio Web UI
Highly Recommended as the copy tokens are displayed on image.
python run_gradio.py
Inference
python inference.py
The script uses a default image URL and text prompt. To use your own inputs, you can modify the image
variable within the messages
list and the text
field for the user prompt.
Coming Soon
- Inference code
- Training data
- Evaluation code
- Training code
Citation
If you find our work valuable, please cite:
@misc{chung2025dontlookoncemultimodal,
title={Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation},
author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
year={2025},
eprint={2505.18842},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18842},
}
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support