SoybeanMilk commited on
Commit
83c44bd
·
verified ·
1 Parent(s): 795f920

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +223 -222
README.md CHANGED
@@ -1,223 +1,224 @@
1
- ---
2
- base_model:
3
- - moonshotai/Kimi-VL-A3B-Instruct
4
- license: mit
5
- pipeline_tag: image-text-to-text
6
- library_name: transformers
7
- ---
8
-
9
- > [!Note]
10
- > This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
11
-
12
- > [!Note]
13
- > Please visit our tech blog for recommended inference recipe of this model: [Kimi-VL-A3B-Thinking-2506: A Quick Navigation](https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506)
14
- <div align="center">
15
- <img width="80%" src="figures/logo.png">
16
- </div>
17
-
18
- <div align="center">
19
- <a href="https://arxiv.org/abs/2504.07491">
20
- <b>📄 Tech Report</b>
21
- </a> &nbsp;|&nbsp;
22
- <a href="https://github.com/MoonshotAI/Kimi-VL">
23
- <b>📄 Github</b>
24
- </a> &nbsp;|&nbsp;
25
- <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking">💬 <b>Chat Web</b></a>
26
- </div>
27
-
28
- ## 1. Introduction
29
-
30
- This is an updated version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), with following improved abilities:
31
-
32
- - **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.3), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
33
- - **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
34
- - **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
35
- - **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
36
-
37
-
38
- ## 2. Performance
39
-
40
- Comparison with efficient models and two previous versions of Kimi-VL (*Results of GPT-4o is for reference here, and shown in <i>italics</i>):
41
-
42
- <div align="center">
43
-
44
- | Benchmark (Metric) | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
45
- |----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
46
- | **General Multimodal** | | | | | | |
47
- | MMBench-EN-v1.1 (Acc) | *83.1* | 83.2 | 74.6 | 82.9 | 76.0 | **84.4** |
48
- | RealWorldQA (Acc) | *75.4* | 68.5 | 59.1 | 68.1 | 64.0 | **70.0** |
49
- | OCRBench (Acc) | *815* | 864 | 702 | 864 | 864 | **869** |
50
- | MMStar (Acc) | *64.7* | 63.0 | 56.1 | 61.7 | 64.2 | **70.4** |
51
- | MMVet (Acc) | *69.1* | 67.1 | 64.9 | 66.7 | 69.5 | **78.1** |
52
- | **Reasoning** | | | | | | |
53
- | MMMU (val, Pass@1) | *69.1* | 58.6 | 59.6 | 57.0 | 61.7 | **64.0** |
54
- | MMMU-Pro (Pass@1) | *51.7* | 38.1 | 32.1 | 36.0 | 43.2 | **46.3** |
55
- | **Math** | | | | | | |
56
- | MATH-Vision (Pass@1) | *30.4* | 25.0 | 32.1 | 21.7 | 36.8 | **56.9** |
57
- | MathVista_MINI (Pass@1) | *63.8* | 68.0 | 56.1 | 68.6 | 71.7 | **80.1** |
58
- | **Video** | | | | | | |
59
- | VideoMMMU (Pass@1) | *61.2* | 47.4 | 57.0 | 52.1 | 55.5 | **65.2** |
60
- | MMVU (Pass@1) | *67.4* | 50.1 | 57.0 | 52.7 | 53.0 | **57.5** |
61
- | Video-MME (w/ sub.) | *77.2* | 71.6 | 62.1 | **72.7** | 66.0 | 71.9 |
62
- | **Agent Grounding** | | | | | | |
63
- | ScreenSpot-Pro (Acc) | *0.8* | 29.0 || 35.4 || **52.8** |
64
- | ScreenSpot-V2 (Acc) | *18.1* | 84.2 | — | **92.8** | — | 91.4 |
65
- | OSWorld-G (Acc) | - | *31.5* | — | 41.6 | — | **52.5** |
66
- | **Long Document** | | | | | | |
67
- | MMLongBench-DOC (Acc) | *42.8* | 29.6 | 21.3 | 35.1 | 32.5 | **42.1** |
68
- </div>
69
-
70
-
71
- Comparison with 30B-70B open-source models:
72
-
73
- <div align="center">
74
-
75
- | Benchmark (Metric) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Qwen2.5-VL-72B | Gemma3-27B-IT |
76
- |----------------------------|---------------------------|---------------|---------------|---------------|
77
- | **General Multimodal** | | | | |
78
- | MMBench-EN-v1.1 (Acc) | 84.4 | - | 88.3 | 78.9 |
79
- | RealWorldQA (Acc) | 70.0 | - | 75.7 | 62.5 |
80
- | OCRBench (Acc) | 869 | - | 885 | 753 |
81
- | MMStar (Acc) | 70.4 | 69.5 | 70.8 | 63.1 |
82
- | MMVet (Acc) | 78.1 | - | 74.0 | 71.0 |
83
- | **Reasoning** | | | ||
84
- | MMMU (val, Pass@1) | 64.0 | 70.0 | 70.2 | 64.9 |
85
- | MMMU-Pro (Pass@1) | 46.3 | 49.5 | 51.1 | - |
86
- | MATH-Vision (Pass@1) | 56.9 | 38.4 | 38.1 | 35.4 |
87
- | MathVista\_MINI (Pass@1) | 80.1 | 74.7 | 74.8 | 59.8 |
88
- | **Video** | | | | |
89
- | VideoMMMU (Pass@1) | 65.2 | - | 60.2 | 61.8 |
90
- | MMVU (Pass@1) | 57.5 | - | 62.9 | 61.3 |
91
- | Video-MME (w/ sub.) | 71.9 | 70.5/77.9 | 73.3/79.1 | - |
92
- | **Agent Grounding** | | | | |
93
- | ScreenSpot-Pro (Acc) | 52.8 | 39.4 | 43.6 | - |
94
- | ScreenSpot-V2 (Acc) | 91.4 | - | - | - |
95
- | OSWorld-G (Acc) | 52.5 | 46.5 | - | - |
96
- | **Long Document** | | | | |
97
- | MMLongBench-DOC (Acc) | 42.1 | - | 38.8 | - |
98
- </div>
99
-
100
-
101
- ## 3. Usage
102
-
103
- ### 3.1. Inference with VLLM (recommended)
104
-
105
- As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
106
-
107
- ```shell
108
- MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
109
- ```
110
-
111
- > [!Note]
112
- > It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
113
-
114
- ```python
115
- from transformers import AutoProcessor
116
- from vllm import LLM, SamplingParams
117
-
118
- model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
119
- llm = LLM(
120
- model_path,
121
- trust_remote_code=True,
122
- max_num_seqs=8,
123
- max_model_len=131072,
124
- limit_mm_per_prompt={"image": 256}
125
- )
126
-
127
- processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
128
-
129
- sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
130
-
131
-
132
- import requests
133
- from PIL import Image
134
-
135
- def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
136
- if bot in text and eot not in text:
137
- return ""
138
- if eot in text:
139
- return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
140
- return "", text
141
-
142
- OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
143
-
144
- url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
145
- image = Image.open(requests.get(url,stream=True).raw)
146
-
147
- messages = [
148
- {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
149
- ]
150
- text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
151
-
152
- outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
153
- generated_text = outputs[0].outputs[0].text
154
-
155
- thinking, summary = extract_thinking_and_summary(generated_text)
156
- print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
157
- ```
158
-
159
-
160
- ### 3.2. Inference with 🤗 Hugging Face Transformers
161
-
162
- We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
163
-
164
- ```python
165
- from PIL import Image
166
- from transformers import AutoModelForCausalLM, AutoProcessor
167
-
168
- def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
169
- if bot in text and eot not in text:
170
- return ""
171
- if eot in text:
172
- return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
173
- return "", text
174
-
175
- OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
176
-
177
- url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
178
-
179
- model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
180
- model = AutoModelForCausalLM.from_pretrained(
181
- model_path,
182
- torch_dtype="auto",
183
- device_map="auto",
184
- trust_remote_code=True,
185
- )
186
- processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
187
-
188
- image_paths = ["url"]
189
- images = [Image.open(path) for path in image_paths]
190
- messages = [
191
- {
192
- "role": "user",
193
- "content": [
194
- {"type": "image", "image": image_path} for image_path in image_paths
195
- ] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
196
- },
197
- ]
198
- text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
199
- inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
200
- generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
201
- generated_ids_trimmed = [
202
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
203
- ]
204
- response = processor.batch_decode(
205
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
206
- )[0]
207
- print(response)
208
- ```
209
-
210
-
211
- ## 4. Citation
212
-
213
- ```
214
- @misc{kimiteam2025kimivltechnicalreport,
215
- title={{Kimi-VL} Technical Report},
216
- author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
217
- year={2025},
218
- eprint={2504.07491},
219
- archivePrefix={arXiv},
220
- primaryClass={cs.CV},
221
- url={https://arxiv.org/abs/2504.07491},
222
- }
 
223
  ```
 
1
+ ---
2
+ base_model:
3
+ - moonshotai/Kimi-VL-A3B-Instruct
4
+ - moonshotai/Kimi-VL-A3B-Thinking-2506
5
+ license: mit
6
+ pipeline_tag: image-text-to-text
7
+ library_name: transformers
8
+ ---
9
+
10
+ > [!Note]
11
+ > This is an improved version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking). Please consider using this updated model instead of the previous version.
12
+
13
+ > [!Note]
14
+ > Please visit our tech blog for recommended inference recipe of this model: [Kimi-VL-A3B-Thinking-2506: A Quick Navigation](https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506)
15
+ <div align="center">
16
+ <img width="80%" src="figures/logo.png">
17
+ </div>
18
+
19
+ <div align="center">
20
+ <a href="https://arxiv.org/abs/2504.07491">
21
+ <b>📄 Tech Report</b>
22
+ </a> &nbsp;|&nbsp;
23
+ <a href="https://github.com/MoonshotAI/Kimi-VL">
24
+ <b>📄 Github</b>
25
+ </a> &nbsp;|&nbsp;
26
+ <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking">💬 <b>Chat Web</b></a>
27
+ </div>
28
+
29
+ ## 1. Introduction
30
+
31
+ This is an updated version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking), with following improved abilities:
32
+
33
+ - **It Thinks Smarter while Consuming Less Tokens**: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.3), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
34
+ - **It Sees Clearer with Thinking**: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model ([Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
35
+ - **It Extends to Video Scenarios**: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching [Kimi-VL-A3B-Instruct](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct)).
36
+ - **It Extends to Higher Resolution**: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).
37
+
38
+
39
+ ## 2. Performance
40
+
41
+ Comparison with efficient models and two previous versions of Kimi-VL (*Results of GPT-4o is for reference here, and shown in <i>italics</i>):
42
+
43
+ <div align="center">
44
+
45
+ | Benchmark (Metric) | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
46
+ |----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
47
+ | **General Multimodal** | | | | | | |
48
+ | MMBench-EN-v1.1 (Acc) | *83.1* | 83.2 | 74.6 | 82.9 | 76.0 | **84.4** |
49
+ | RealWorldQA (Acc) | *75.4* | 68.5 | 59.1 | 68.1 | 64.0 | **70.0** |
50
+ | OCRBench (Acc) | *815* | 864 | 702 | 864 | 864 | **869** |
51
+ | MMStar (Acc) | *64.7* | 63.0 | 56.1 | 61.7 | 64.2 | **70.4** |
52
+ | MMVet (Acc) | *69.1* | 67.1 | 64.9 | 66.7 | 69.5 | **78.1** |
53
+ | **Reasoning** | | | | | | |
54
+ | MMMU (val, Pass@1) | *69.1* | 58.6 | 59.6 | 57.0 | 61.7 | **64.0** |
55
+ | MMMU-Pro (Pass@1) | *51.7* | 38.1 | 32.1 | 36.0 | 43.2 | **46.3** |
56
+ | **Math** | | | | | | |
57
+ | MATH-Vision (Pass@1) | *30.4* | 25.0 | 32.1 | 21.7 | 36.8 | **56.9** |
58
+ | MathVista_MINI (Pass@1) | *63.8* | 68.0 | 56.1 | 68.6 | 71.7 | **80.1** |
59
+ | **Video** | | | | | | |
60
+ | VideoMMMU (Pass@1) | *61.2* | 47.4 | 57.0 | 52.1 | 55.5 | **65.2** |
61
+ | MMVU (Pass@1) | *67.4* | 50.1 | 57.0 | 52.7 | 53.0 | **57.5** |
62
+ | Video-MME (w/ sub.) | *77.2* | 71.6 | 62.1 | **72.7** | 66.0 | 71.9 |
63
+ | **Agent Grounding** | | | | | | |
64
+ | ScreenSpot-Pro (Acc) | *0.8* | 29.0 | — | 35.4 | — | **52.8** |
65
+ | ScreenSpot-V2 (Acc) | *18.1* | 84.2 | — | **92.8** | — | 91.4 |
66
+ | OSWorld-G (Acc) | - | *31.5* || 41.6 || **52.5** |
67
+ | **Long Document** | | | | | | |
68
+ | MMLongBench-DOC (Acc) | *42.8* | 29.6 | 21.3 | 35.1 | 32.5 | **42.1** |
69
+ </div>
70
+
71
+
72
+ Comparison with 30B-70B open-source models:
73
+
74
+ <div align="center">
75
+
76
+ | Benchmark (Metric) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Qwen2.5-VL-72B | Gemma3-27B-IT |
77
+ |----------------------------|---------------------------|---------------|---------------|---------------|
78
+ | **General Multimodal** | | | | |
79
+ | MMBench-EN-v1.1 (Acc) | 84.4 | - | 88.3 | 78.9 |
80
+ | RealWorldQA (Acc) | 70.0 | - | 75.7 | 62.5 |
81
+ | OCRBench (Acc) | 869 | - | 885 | 753 |
82
+ | MMStar (Acc) | 70.4 | 69.5 | 70.8 | 63.1 |
83
+ | MMVet (Acc) | 78.1 | - | 74.0 | 71.0 |
84
+ | **Reasoning** | | | ||
85
+ | MMMU (val, Pass@1) | 64.0 | 70.0 | 70.2 | 64.9 |
86
+ | MMMU-Pro (Pass@1) | 46.3 | 49.5 | 51.1 | - |
87
+ | MATH-Vision (Pass@1) | 56.9 | 38.4 | 38.1 | 35.4 |
88
+ | MathVista\_MINI (Pass@1) | 80.1 | 74.7 | 74.8 | 59.8 |
89
+ | **Video** | | | | |
90
+ | VideoMMMU (Pass@1) | 65.2 | - | 60.2 | 61.8 |
91
+ | MMVU (Pass@1) | 57.5 | - | 62.9 | 61.3 |
92
+ | Video-MME (w/ sub.) | 71.9 | 70.5/77.9 | 73.3/79.1 | - |
93
+ | **Agent Grounding** | | | | |
94
+ | ScreenSpot-Pro (Acc) | 52.8 | 39.4 | 43.6 | - |
95
+ | ScreenSpot-V2 (Acc) | 91.4 | - | - | - |
96
+ | OSWorld-G (Acc) | 52.5 | 46.5 | - | - |
97
+ | **Long Document** | | | | |
98
+ | MMLongBench-DOC (Acc) | 42.1 | - | 38.8 | - |
99
+ </div>
100
+
101
+
102
+ ## 3. Usage
103
+
104
+ ### 3.1. Inference with VLLM (recommended)
105
+
106
+ As a long-decode model that will generates up to 32K tokens, we recommend using [VLLM](https://github.com/vllm-project/vllm/tree/main/vllm) for inference, which has already supported Kimi-VL series.
107
+
108
+ ```shell
109
+ MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation
110
+ ```
111
+
112
+ > [!Note]
113
+ > It is important to explicitly install flash-attn to avoid CUDA out-of-memory.
114
+
115
+ ```python
116
+ from transformers import AutoProcessor
117
+ from vllm import LLM, SamplingParams
118
+
119
+ model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
120
+ llm = LLM(
121
+ model_path,
122
+ trust_remote_code=True,
123
+ max_num_seqs=8,
124
+ max_model_len=131072,
125
+ limit_mm_per_prompt={"image": 256}
126
+ )
127
+
128
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
129
+
130
+ sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)
131
+
132
+
133
+ import requests
134
+ from PIL import Image
135
+
136
+ def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
137
+ if bot in text and eot not in text:
138
+ return ""
139
+ if eot in text:
140
+ return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
141
+ return "", text
142
+
143
+ OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
144
+
145
+ url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
146
+ image = Image.open(requests.get(url,stream=True).raw)
147
+
148
+ messages = [
149
+ {"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
150
+ ]
151
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
152
+
153
+ outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
154
+ generated_text = outputs[0].outputs[0].text
155
+
156
+ thinking, summary = extract_thinking_and_summary(generated_text)
157
+ print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))
158
+ ```
159
+
160
+
161
+ ### 3.2. Inference with 🤗 Hugging Face Transformers
162
+
163
+ We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
164
+
165
+ ```python
166
+ from PIL import Image
167
+ from transformers import AutoModelForCausalLM, AutoProcessor
168
+
169
+ def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
170
+ if bot in text and eot not in text:
171
+ return ""
172
+ if eot in text:
173
+ return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
174
+ return "", text
175
+
176
+ OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"
177
+
178
+ url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
179
+
180
+ model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
181
+ model = AutoModelForCausalLM.from_pretrained(
182
+ model_path,
183
+ torch_dtype="auto",
184
+ device_map="auto",
185
+ trust_remote_code=True,
186
+ )
187
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
188
+
189
+ image_paths = ["url"]
190
+ images = [Image.open(path) for path in image_paths]
191
+ messages = [
192
+ {
193
+ "role": "user",
194
+ "content": [
195
+ {"type": "image", "image": image_path} for image_path in image_paths
196
+ ] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
197
+ },
198
+ ]
199
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
200
+ inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
201
+ generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
202
+ generated_ids_trimmed = [
203
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
204
+ ]
205
+ response = processor.batch_decode(
206
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
207
+ )[0]
208
+ print(response)
209
+ ```
210
+
211
+
212
+ ## 4. Citation
213
+
214
+ ```
215
+ @misc{kimiteam2025kimivltechnicalreport,
216
+ title={{Kimi-VL} Technical Report},
217
+ author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
218
+ year={2025},
219
+ eprint={2504.07491},
220
+ archivePrefix={arXiv},
221
+ primaryClass={cs.CV},
222
+ url={https://arxiv.org/abs/2504.07491},
223
+ }
224
  ```