Improve Kimi-VL-A3B-Instruct model card

#6
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -97
README.md CHANGED
@@ -1,34 +1,31 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - moonshotai/Moonlight-16B-A3B
 
5
  pipeline_tag: image-text-to-text
 
6
  ---
7
 
8
-
 
 
9
 
10
  <div align="center">
11
- <img width="30%" src="figures/logo.png">
 
 
 
12
  </div>
13
 
14
 
15
  ## Introduction
16
 
17
- We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**โ€”all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
18
-
19
- Kimi-VL demonstrates strong performance across challenging domains:
20
- as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models.
21
- Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.
22
 
23
- In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.
24
-
25
- Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.
26
-
27
- Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
28
 
29
  ## Architecture
30
 
31
- The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
32
 
33
  <div align="center">
34
  <img width="90%" src="figures/arch.png">
@@ -36,98 +33,31 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
36
 
37
  ## Model Variants
38
 
39
- ๐Ÿค— For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
40
-
41
- <div align="center">
42
-
43
  | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
44
  | :------------: | :------------: | :------------: | :------------: | :------------: |
45
  | Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [๐Ÿค— Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
46
  | Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [๐Ÿค— Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
47
 
48
- </div>
49
-
50
- ## Performance
51
 
52
- As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
53
 
 
54
 
55
- A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):
56
 
57
  <div align="center">
58
  <img width="100%" src="figures/instruct_perf.png">
59
  </div>
60
 
61
- Full comparison (GPT-4o included for reference):
62
-
63
- <div align="center">
64
-
65
- | Benchmark (Metric) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct |
66
- |--------------------------------|--------|-------------|---------------|--------------------|---------------|--------------|-------------|
67
- | **Architecture** | - | - | Dense | Dense | Dense | MoE | MoE |
68
- | **# Act. Params (LLM+VT)** | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B |
69
- | **# Total Params** | - | - | 8B | 11B | 12B | 28B | 16B |
70
- | | | | | | | | |
71
- | **College-level** | | | | | | | |
72
- | MMMU-Val (Pass@1) | *69.1* | **60.0** | 58.6 | 48 | 59.6 | 51.1 | 57.0 |
73
- | VideoMMMU (Pass@1) | *61.2* | - | 47.4 | 41.8 | **57.2** | 44.4 | 52.6 |
74
- | MMVU-Val (Pass@1) | *67.4* | **61.6** | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 |
75
- | | | | | | | | |
76
- | **General** | | | | | | | |
77
- | MMBench-EN-v1.1 (Acc) | *83.1* | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | **83.1** |
78
- | MMStar (Acc) | *64.7* | 54.8 | **63.9** | 49.8 | 56.1 | 55.5 | 61.3 |
79
- | MMVet (Pass@1) | *69.1* | 66.9 | **67.1** | 57.6 | 64.9 | 60.0 | 66.7 |
80
- | RealWorldQA (Acc) | *75.4* | 67.1 | **68.5** | 63.3 | 59.1 | 68.4 | 68.1 |
81
- | AI2D (Acc) | *84.6* | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | **84.9** |
82
- | | | | | | | | |
83
- | **Multi-image** | | | | | | | |
84
- | BLINK (Acc) | *68.0* | 53.6 | 56.4 | 39.8 | 50.3 | - | **57.3** |
85
- | | | | | | | | |
86
- | **Math** | | | | | | | |
87
- | MathVista (Pass@1) | *63.8* | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | **68.7** |
88
- | MathVision (Pass@1) | *30.4* | - | 25.1 | 13.6 | **32.1** | 17.3 | 21.4 |
89
- | | | | | | | | |
90
- | **OCR** | | | | | | | |
91
- | InfoVQA (Acc) | *80.7* | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | **83.2** |
92
- | OCRBench (Acc) | *815* | 785 | 864 | 753 | 702 | 811 | **867** |
93
- | | | | | | | | |
94
- | **OS Agent** | | | | | | | |
95
- | ScreenSpot-V2 (Acc) | *18.1* | 6.9 | 84.2 | - | - | - | **92.8** |
96
- | ScreenSpot-Pro (Acc) | *0.8* | - | 29.0 | - | - | - | **34.5** |
97
- | OSWorld (Pass@1) | *5.03* | - | 2.5 | - | - | - | **8.22** |
98
- | WindowsAgentArena (Pass@1) | *9.4* | 2.7 | 3.4 | - | - | - | **10.4** |
99
- | | | | | | | | |
100
- | **Long Document** | | | | | | | |
101
- | MMLongBench-Doc (Acc) | *42.8* | 29.0 | 29.6 | 13.8 | 21.3 | - | **35.1** |
102
- | | | | | | | | |
103
- | **Long Video** | | | | | | | |
104
- | Video-MME (w/o sub.) | *71.9* | 64.8 | 65.1 | 46.0 | 58.2 | - | **67.8** |
105
- | Video-MME (w sub.) | *77.2* | 68.9 | 71.6 | 49.5 | 62.1 | - | **72.6** |
106
- | MLVU-MCQ (Acc) | *64.6* | 48.1 | 70.2 | 44.4 | 52.3 | - | **74.2** |
107
- | LongVideoBench (val) | *66.7* | 58.2 | 56.0 | 45.5 | 51.5 | - | **64.5** |
108
- | | | | | | | | |
109
- | **Video Perception** | | | | | | | |
110
- | EgoSchema (full) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | **78.5** |
111
- | VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | **37.4** |
112
- | TOMATO | *37.7* | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | **31.7** |
113
-
114
- </div>
115
-
116
- ### Inference with ๐Ÿค— Hugging Face Transformers
117
 
118
- We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
119
 
120
  ```python
121
  from PIL import Image
122
  from transformers import AutoModelForCausalLM, AutoProcessor
123
 
124
  model_path = "moonshotai/Kimi-VL-A3B-Instruct"
125
- model = AutoModelForCausalLM.from_pretrained(
126
- model_path,
127
- torch_dtype="auto",
128
- device_map="auto",
129
- trust_remote_code=True,
130
- )
131
  processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
132
 
133
  image_path = "./figures/demo.png"
@@ -138,30 +68,25 @@ messages = [
138
  text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
139
  inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
140
  generated_ids = model.generate(**inputs, max_new_tokens=512)
141
- generated_ids_trimmed = [
142
- out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
143
- ]
144
- response = processor.batch_decode(
145
- generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
146
- )[0]
147
  print(response)
148
  ```
149
 
150
- ### Inference with VLLM
151
 
152
- We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
153
 
154
  ## Citation
155
 
156
  ```
157
  @misc{kimiteam2025kimivltechnicalreport,
158
  title={{Kimi-VL} Technical Report},
159
- author={Kimi Team and Angang Du and Bohong Yin and Bowei Xing and Bowen Qu and Bowen Wang and Cheng Chen and Chenlin Zhang and Chenzhuang Du and Chu Wei and Congcong Wang and Dehao Zhang and Dikang Du and Dongliang Wang and Enming Yuan and Enzhe Lu and Fang Li and Flood Sung and Guangda Wei and Guokun Lai and Han Zhu and Hao Ding and Hao Hu and Hao Yang and Hao Zhang and Haoning Wu and Haotian Yao and Haoyu Lu and Heng Wang and Hongcheng Gao and Huabin Zheng and Jiaming Li and Jianlin Su and Jianzhou Wang and Jiaqi Deng and Jiezhong Qiu and Jin Xie and Jinhong Wang and Jingyuan Liu and Junjie Yan and Kun Ouyang and Liang Chen and Lin Sui and Longhui Yu and Mengfan Dong and Mengnan Dong and Nuo Xu and Pengyu Cheng and Qizheng Gu and Runjie Zhou and Shaowei Liu and Sihan Cao and Tao Yu and Tianhui Song and Tongtong Bai and Wei Song and Weiran He and Weixiao Huang and Weixin Xu and Xiaokun Yuan and Xingcheng Yao and Xingzhe Wu and Xinxing Zu and Xinyu Zhou and Xinyuan Wang and Y. Charles and Yan Zhong and Yang Li and Yangyang Hu and Yanru Chen and Yejie Wang and Yibo Liu and Yibo Miao and Yidao Qin and Yimin Chen and Yiping Bao and Yiqin Wang and Yongsheng Kang and Yuanxin Liu and Yulun Du and Yuxin Wu and Yuzhi Wang and Yuzi Yan and Zaida Zhou and Zhaowei Li and Zhejun Jiang and Zheng Zhang and Zhilin Yang and Zhiqi Huang and Zihao Huang and Zijia Zhao and Ziwei Chen},
160
  year={2025},
161
  eprint={2504.07491},
162
  archivePrefix={arXiv},
163
  primaryClass={cs.CV},
164
  url={https://arxiv.org/abs/2504.07491},
165
  }
166
- ```
167
-
 
1
  ---
 
2
  base_model:
3
  - moonshotai/Moonlight-16B-A3B
4
+ license: mit
5
  pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  ---
8
 
9
+ <div align="center">
10
+ <a href="Kimi-VL.pdf">KIMI-VL TECHNICAL REPORT</a>
11
+ </div>
12
 
13
  <div align="center">
14
+ <a href="https://arxiv.org/abs/2504.07491"><img src="figures/logo.png" height="16" width="16" style="vertical-align:middle"><b> Tech Report</b></a> |
15
+ <a href="https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="vertical-align:middle"><b> HuggingFace</b>
16
+ </a> |
17
+ <a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">๐Ÿ’ฌ Chat Web</a>
18
  </div>
19
 
20
 
21
  ## Introduction
22
 
23
+ We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**โ€”all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across diverse challenging vision-language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and more. It effectively competes with cutting-edge efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, even surpassing GPT-4o in several specialized domains. Kimi-VL also excels in processing long contexts and high-resolution images, achieving impressive results on benchmarks like LongVideoBench, MMLongBench-Doc, InfoVQA, and ScreenSpot-Pro. We also introduce **Kimi-VL-Thinking**, a variant fine-tuned for long-horizon reasoning, achieving high scores on MMMU, MathVision, and MathVista with a compact 2.8B activated LLM parameter footprint.
 
 
 
 
24
 
 
 
 
 
 
25
 
26
  ## Architecture
27
 
28
+ Kimi-VL uses a Mixture-of-Experts (MoE) language model, a native-resolution visual encoder (MoonViT), and an MLP projector.
29
 
30
  <div align="center">
31
  <img width="90%" src="figures/arch.png">
 
33
 
34
  ## Model Variants
35
 
 
 
 
 
36
  | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
37
  | :------------: | :------------: | :------------: | :------------: | :------------: |
38
  | Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [๐Ÿค— Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
39
  | Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [๐Ÿค— Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
40
 
41
+ For general multimodal tasks, OCR, long video/document understanding, video perception, and agent applications, we recommend `Kimi-VL-A3B-Instruct`. For advanced text and multimodal reasoning (e.g., math), use `Kimi-VL-A3B-Thinking`. You can also chat with the `Kimi-VL-A3B-Thinking` model on our [Huggingface Demo](https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/).
 
 
42
 
 
43
 
44
+ ## Performance
45
 
46
+ Kimi-VL robustly handles diverse tasks (perception, math, college-level problems, OCR, agent interaction) across various input formats (image, multi-image, video, long-document). See the Tech Report for detailed benchmark results. A brief comparison with other models:
47
 
48
  <div align="center">
49
  <img width="100%" src="figures/instruct_perf.png">
50
  </div>
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ ## Example Usage (Transformers)
54
 
55
  ```python
56
  from PIL import Image
57
  from transformers import AutoModelForCausalLM, AutoProcessor
58
 
59
  model_path = "moonshotai/Kimi-VL-A3B-Instruct"
60
+ model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
 
 
 
 
 
61
  processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
62
 
63
  image_path = "./figures/demo.png"
 
68
  text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
69
  inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
70
  generated_ids = model.generate(**inputs, max_new_tokens=512)
71
+ generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
72
+ response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 
 
 
 
73
  print(response)
74
  ```
75
 
76
+ ## Deployment (vLLM)
77
 
78
+ We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM for easier deployment.
79
 
80
  ## Citation
81
 
82
  ```
83
  @misc{kimiteam2025kimivltechnicalreport,
84
  title={{Kimi-VL} Technical Report},
85
+ author={Kimi Team and ...},
86
  year={2025},
87
  eprint={2504.07491},
88
  archivePrefix={arXiv},
89
  primaryClass={cs.CV},
90
  url={https://arxiv.org/abs/2504.07491},
91
  }
92
+ ```