Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models
Multi-modal keyphrase prediction (MMKP) aims to advance beyond text-only methods by incorporating multiple modalities of input information to produce a set of conclusive phrases. Traditional multi-modal approaches have been proven to have significant limitations in handling the challenging absence and unseen scenarios. Additionally, we identify shortcomings in existing benchmarks that overestimate model capability due to significant overlap in training tests. In this work, we propose leveraging vision-language models (VLMs) for the MMKP task. Firstly, we use two widely-used strategies, e.g., zero-shot and supervised fine-tuning (SFT) to assess the lower bound performance of VLMs. Next, to improve the complex reasoning capabilities of VLMs, we adopt Fine-tune-CoT, which leverages high-quality CoT reasoning data generated by a teacher model to finetune smaller models. Finally, to address the βoverthinkingβ phenomenon, we propose a dynamic CoT strategy which adaptively injects CoT data during training, allowing the model to flexibly leverage its reasoning capabilities during the inference stage. We evaluate the proposed strategies on various datasets and the experimental results demonstrate the effectiveness of the proposed approaches.
Our open-source 3B model (based on Qwen2.5-VL-3B-Instruct) achieves state-of-the-art result (61.90 on MMKP).


β‘ Quickstart
Installation
git clone https://github.com/bytedance/DynamicCoT.git
cd DynamicCoT
pip3 install -e ".[torch,metrics,deepspeed]"
# we use transformers==4.52.1 for InternVL3 and transformers==4.49.0 for other models
pip3 install transformers
pip3 install vllm==0.7.3
Test model
# for InternVL3, source_txt in data/mmkp_source/
bash eval_internvl.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
# for other models
bash eval_full_sft.sh {/path/to/model} {/path/to/source_txt} --template {template} --dataset {test_dataset}
π§Ύ License
DynamicCoT are derived from Qwen2.5-VL-3B-Instruct, which is subject to Qwen RESEARCH LICENSE AGREEMENT. We retain ownership of all intellectual property rights in and to any derivative works and modifications that we made.
π Acknowledgement
This project is not possible without multiple great open-sourced code bases. We list some notable examples below.
π Bibtex
If this work is helpful for your research, please consider citing the following BibTeX entry.
@article{ma2025dynamiccot,
title={Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models},
author={Ma, Qihang and Li, Shengyu and Tang, Jie and Yang, Dingkang and Chen, shaodong and Zhang, Yingyi and Feng, Chao and Ran, Jiao},
journal={},
year={2025}
}
- Downloads last month
- 6