Improve Kimi-VL-A3B-Instruct model card
#6
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,34 +1,31 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
base_model:
|
4 |
- moonshotai/Moonlight-16B-A3B
|
|
|
5 |
pipeline_tag: image-text-to-text
|
|
|
6 |
---
|
7 |
|
8 |
-
|
|
|
|
|
9 |
|
10 |
<div align="center">
|
11 |
-
<
|
|
|
|
|
|
|
12 |
</div>
|
13 |
|
14 |
|
15 |
## Introduction
|
16 |
|
17 |
-
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**โall while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
|
18 |
-
|
19 |
-
Kimi-VL demonstrates strong performance across challenging domains:
|
20 |
-
as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models.
|
21 |
-
Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.
|
22 |
|
23 |
-
In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.
|
24 |
-
|
25 |
-
Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.
|
26 |
-
|
27 |
-
Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
|
28 |
|
29 |
## Architecture
|
30 |
|
31 |
-
|
32 |
|
33 |
<div align="center">
|
34 |
<img width="90%" src="figures/arch.png">
|
@@ -36,98 +33,31 @@ The model adopts an MoE language model, a native-resolution visual encoder (Moon
|
|
36 |
|
37 |
## Model Variants
|
38 |
|
39 |
-
๐ค For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
|
40 |
-
|
41 |
-
<div align="center">
|
42 |
-
|
43 |
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
|
44 |
| :------------: | :------------: | :------------: | :------------: | :------------: |
|
45 |
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [๐ค Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
|
46 |
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [๐ค Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
|
47 |
|
48 |
-
|
49 |
-
|
50 |
-
## Performance
|
51 |
|
52 |
-
As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
|
53 |
|
|
|
54 |
|
55 |
-
|
56 |
|
57 |
<div align="center">
|
58 |
<img width="100%" src="figures/instruct_perf.png">
|
59 |
</div>
|
60 |
|
61 |
-
Full comparison (GPT-4o included for reference):
|
62 |
-
|
63 |
-
<div align="center">
|
64 |
-
|
65 |
-
| Benchmark (Metric) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct |
|
66 |
-
|--------------------------------|--------|-------------|---------------|--------------------|---------------|--------------|-------------|
|
67 |
-
| **Architecture** | - | - | Dense | Dense | Dense | MoE | MoE |
|
68 |
-
| **# Act. Params (LLM+VT)** | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B |
|
69 |
-
| **# Total Params** | - | - | 8B | 11B | 12B | 28B | 16B |
|
70 |
-
| | | | | | | | |
|
71 |
-
| **College-level** | | | | | | | |
|
72 |
-
| MMMU-Val (Pass@1) | *69.1* | **60.0** | 58.6 | 48 | 59.6 | 51.1 | 57.0 |
|
73 |
-
| VideoMMMU (Pass@1) | *61.2* | - | 47.4 | 41.8 | **57.2** | 44.4 | 52.6 |
|
74 |
-
| MMVU-Val (Pass@1) | *67.4* | **61.6** | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 |
|
75 |
-
| | | | | | | | |
|
76 |
-
| **General** | | | | | | | |
|
77 |
-
| MMBench-EN-v1.1 (Acc) | *83.1* | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | **83.1** |
|
78 |
-
| MMStar (Acc) | *64.7* | 54.8 | **63.9** | 49.8 | 56.1 | 55.5 | 61.3 |
|
79 |
-
| MMVet (Pass@1) | *69.1* | 66.9 | **67.1** | 57.6 | 64.9 | 60.0 | 66.7 |
|
80 |
-
| RealWorldQA (Acc) | *75.4* | 67.1 | **68.5** | 63.3 | 59.1 | 68.4 | 68.1 |
|
81 |
-
| AI2D (Acc) | *84.6* | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | **84.9** |
|
82 |
-
| | | | | | | | |
|
83 |
-
| **Multi-image** | | | | | | | |
|
84 |
-
| BLINK (Acc) | *68.0* | 53.6 | 56.4 | 39.8 | 50.3 | - | **57.3** |
|
85 |
-
| | | | | | | | |
|
86 |
-
| **Math** | | | | | | | |
|
87 |
-
| MathVista (Pass@1) | *63.8* | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | **68.7** |
|
88 |
-
| MathVision (Pass@1) | *30.4* | - | 25.1 | 13.6 | **32.1** | 17.3 | 21.4 |
|
89 |
-
| | | | | | | | |
|
90 |
-
| **OCR** | | | | | | | |
|
91 |
-
| InfoVQA (Acc) | *80.7* | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | **83.2** |
|
92 |
-
| OCRBench (Acc) | *815* | 785 | 864 | 753 | 702 | 811 | **867** |
|
93 |
-
| | | | | | | | |
|
94 |
-
| **OS Agent** | | | | | | | |
|
95 |
-
| ScreenSpot-V2 (Acc) | *18.1* | 6.9 | 84.2 | - | - | - | **92.8** |
|
96 |
-
| ScreenSpot-Pro (Acc) | *0.8* | - | 29.0 | - | - | - | **34.5** |
|
97 |
-
| OSWorld (Pass@1) | *5.03* | - | 2.5 | - | - | - | **8.22** |
|
98 |
-
| WindowsAgentArena (Pass@1) | *9.4* | 2.7 | 3.4 | - | - | - | **10.4** |
|
99 |
-
| | | | | | | | |
|
100 |
-
| **Long Document** | | | | | | | |
|
101 |
-
| MMLongBench-Doc (Acc) | *42.8* | 29.0 | 29.6 | 13.8 | 21.3 | - | **35.1** |
|
102 |
-
| | | | | | | | |
|
103 |
-
| **Long Video** | | | | | | | |
|
104 |
-
| Video-MME (w/o sub.) | *71.9* | 64.8 | 65.1 | 46.0 | 58.2 | - | **67.8** |
|
105 |
-
| Video-MME (w sub.) | *77.2* | 68.9 | 71.6 | 49.5 | 62.1 | - | **72.6** |
|
106 |
-
| MLVU-MCQ (Acc) | *64.6* | 48.1 | 70.2 | 44.4 | 52.3 | - | **74.2** |
|
107 |
-
| LongVideoBench (val) | *66.7* | 58.2 | 56.0 | 45.5 | 51.5 | - | **64.5** |
|
108 |
-
| | | | | | | | |
|
109 |
-
| **Video Perception** | | | | | | | |
|
110 |
-
| EgoSchema (full) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | **78.5** |
|
111 |
-
| VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | **37.4** |
|
112 |
-
| TOMATO | *37.7* | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | **31.7** |
|
113 |
-
|
114 |
-
</div>
|
115 |
-
|
116 |
-
### Inference with ๐ค Hugging Face Transformers
|
117 |
|
118 |
-
|
119 |
|
120 |
```python
|
121 |
from PIL import Image
|
122 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
123 |
|
124 |
model_path = "moonshotai/Kimi-VL-A3B-Instruct"
|
125 |
-
model = AutoModelForCausalLM.from_pretrained(
|
126 |
-
model_path,
|
127 |
-
torch_dtype="auto",
|
128 |
-
device_map="auto",
|
129 |
-
trust_remote_code=True,
|
130 |
-
)
|
131 |
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
132 |
|
133 |
image_path = "./figures/demo.png"
|
@@ -138,30 +68,25 @@ messages = [
|
|
138 |
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
139 |
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
|
140 |
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
141 |
-
generated_ids_trimmed = [
|
142 |
-
|
143 |
-
]
|
144 |
-
response = processor.batch_decode(
|
145 |
-
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
146 |
-
)[0]
|
147 |
print(response)
|
148 |
```
|
149 |
|
150 |
-
|
151 |
|
152 |
-
We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM
|
153 |
|
154 |
## Citation
|
155 |
|
156 |
```
|
157 |
@misc{kimiteam2025kimivltechnicalreport,
|
158 |
title={{Kimi-VL} Technical Report},
|
159 |
-
author={Kimi Team and
|
160 |
year={2025},
|
161 |
eprint={2504.07491},
|
162 |
archivePrefix={arXiv},
|
163 |
primaryClass={cs.CV},
|
164 |
url={https://arxiv.org/abs/2504.07491},
|
165 |
}
|
166 |
-
```
|
167 |
-
|
|
|
1 |
---
|
|
|
2 |
base_model:
|
3 |
- moonshotai/Moonlight-16B-A3B
|
4 |
+
license: mit
|
5 |
pipeline_tag: image-text-to-text
|
6 |
+
library_name: transformers
|
7 |
---
|
8 |
|
9 |
+
<div align="center">
|
10 |
+
<a href="Kimi-VL.pdf">KIMI-VL TECHNICAL REPORT</a>
|
11 |
+
</div>
|
12 |
|
13 |
<div align="center">
|
14 |
+
<a href="https://arxiv.org/abs/2504.07491"><img src="figures/logo.png" height="16" width="16" style="vertical-align:middle"><b> Tech Report</b></a> |
|
15 |
+
<a href="https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="vertical-align:middle"><b> HuggingFace</b>
|
16 |
+
</a> |
|
17 |
+
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">๐ฌ Chat Web</a>
|
18 |
</div>
|
19 |
|
20 |
|
21 |
## Introduction
|
22 |
|
23 |
+
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**โall while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across diverse challenging vision-language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and more. It effectively competes with cutting-edge efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, even surpassing GPT-4o in several specialized domains. Kimi-VL also excels in processing long contexts and high-resolution images, achieving impressive results on benchmarks like LongVideoBench, MMLongBench-Doc, InfoVQA, and ScreenSpot-Pro. We also introduce **Kimi-VL-Thinking**, a variant fine-tuned for long-horizon reasoning, achieving high scores on MMMU, MathVision, and MathVista with a compact 2.8B activated LLM parameter footprint.
|
|
|
|
|
|
|
|
|
24 |
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
## Architecture
|
27 |
|
28 |
+
Kimi-VL uses a Mixture-of-Experts (MoE) language model, a native-resolution visual encoder (MoonViT), and an MLP projector.
|
29 |
|
30 |
<div align="center">
|
31 |
<img width="90%" src="figures/arch.png">
|
|
|
33 |
|
34 |
## Model Variants
|
35 |
|
|
|
|
|
|
|
|
|
36 |
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
|
37 |
| :------------: | :------------: | :------------: | :------------: | :------------: |
|
38 |
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [๐ค Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
|
39 |
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [๐ค Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
|
40 |
|
41 |
+
For general multimodal tasks, OCR, long video/document understanding, video perception, and agent applications, we recommend `Kimi-VL-A3B-Instruct`. For advanced text and multimodal reasoning (e.g., math), use `Kimi-VL-A3B-Thinking`. You can also chat with the `Kimi-VL-A3B-Thinking` model on our [Huggingface Demo](https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/).
|
|
|
|
|
42 |
|
|
|
43 |
|
44 |
+
## Performance
|
45 |
|
46 |
+
Kimi-VL robustly handles diverse tasks (perception, math, college-level problems, OCR, agent interaction) across various input formats (image, multi-image, video, long-document). See the Tech Report for detailed benchmark results. A brief comparison with other models:
|
47 |
|
48 |
<div align="center">
|
49 |
<img width="100%" src="figures/instruct_perf.png">
|
50 |
</div>
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
|
53 |
+
## Example Usage (Transformers)
|
54 |
|
55 |
```python
|
56 |
from PIL import Image
|
57 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
58 |
|
59 |
model_path = "moonshotai/Kimi-VL-A3B-Instruct"
|
60 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True)
|
|
|
|
|
|
|
|
|
|
|
61 |
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
62 |
|
63 |
image_path = "./figures/demo.png"
|
|
|
68 |
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
|
69 |
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
|
70 |
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
71 |
+
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
|
72 |
+
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
|
|
|
|
|
|
|
|
73 |
print(response)
|
74 |
```
|
75 |
|
76 |
+
## Deployment (vLLM)
|
77 |
|
78 |
+
We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM for easier deployment.
|
79 |
|
80 |
## Citation
|
81 |
|
82 |
```
|
83 |
@misc{kimiteam2025kimivltechnicalreport,
|
84 |
title={{Kimi-VL} Technical Report},
|
85 |
+
author={Kimi Team and ...},
|
86 |
year={2025},
|
87 |
eprint={2504.07491},
|
88 |
archivePrefix={arXiv},
|
89 |
primaryClass={cs.CV},
|
90 |
url={https://arxiv.org/abs/2504.07491},
|
91 |
}
|
92 |
+
```
|
|