nielsr HF Staff commited on
Commit
84052e3
·
verified ·
1 Parent(s): 54d622b

Add library_name metadata and link to paper

Browse files

This PR ensures the model card is linked to the relevant paper, while also ensuring a "how to use" button
appears on the top right.

Files changed (1) hide show
  1. README.md +6 -9
README.md CHANGED
@@ -1,17 +1,15 @@
1
  ---
2
- license: mit
3
  base_model:
4
  - moonshotai/Kimi-VL-A3B-Instruct
 
5
  pipeline_tag: image-text-to-text
 
6
  ---
7
 
8
-
9
-
10
  <div align="center">
11
  <img width="30%" src="figures/logo.png">
12
  </div>
13
 
14
-
15
  ## Introduction
16
 
17
  We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
@@ -26,6 +24,8 @@ Kimi-VL also advances the pareto frontiers of multimodal models in processing lo
26
 
27
  Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
28
 
 
 
29
  ## Architecture
30
 
31
  The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
@@ -59,7 +59,6 @@ Full comparison on MMMU, MathVision, and MathVista-mini:
59
 
60
  <div align="center">
61
 
62
-
63
  | Benchmark (Metric) | GPT-4o | GPT-4o-mini | Qwen2.5-VL-72B | Qwen2.5-VL-7B | Gemma-3-27B | Gemma-3-12B | o1-1217 | QVQ-72B | Kimi-k1.5 | Kimi-VL-Thinking-A3B |
64
  |---------------------------------|--------|-------------|----------------|---------------|-------------|-------------|---------|----------|-----------|----------------------|
65
  | *Thinking Model?* | | | | | | | ✅ | ✅ | ✅ | ✅ |
@@ -67,7 +66,6 @@ Full comparison on MMMU, MathVision, and MathVista-mini:
67
  | MathVista (mini) (Pass@1) | 63.8 | 56.7 | 74.8 | 68.2 | 62.3 | 56.4 | 71.0 | 71.4 | 74.9 | 71.3 |
68
  | MMMU (val) (Pass@1) | 69.1 | 60.0 | 74.8 | 58.6 | 64.8 | 59.6 | 77.3 | 70.3 | 70.0 | 61.7 |
69
 
70
-
71
  </div>
72
 
73
  ### Inference with 🤗 Hugging Face Transformers
@@ -113,7 +111,7 @@ print(response)
113
 
114
  We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
115
 
116
- ## Citation
117
 
118
  ```
119
  @misc{kimiteam2025kimivltechnicalreport,
@@ -125,5 +123,4 @@ We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/
125
  primaryClass={cs.CV},
126
  url={https://arxiv.org/abs/2504.07491},
127
  }
128
- ```
129
-
 
1
  ---
 
2
  base_model:
3
  - moonshotai/Kimi-VL-A3B-Instruct
4
+ license: mit
5
  pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
  ---
8
 
 
 
9
  <div align="center">
10
  <img width="30%" src="figures/logo.png">
11
  </div>
12
 
 
13
  ## Introduction
14
 
15
  We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
 
24
 
25
  Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
26
 
27
+ More information can be found in our technical report: [Kimi-VL Technical Report](https://arxiv.org/abs/2504.07491).
28
+
29
  ## Architecture
30
 
31
  The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
 
59
 
60
  <div align="center">
61
 
 
62
  | Benchmark (Metric) | GPT-4o | GPT-4o-mini | Qwen2.5-VL-72B | Qwen2.5-VL-7B | Gemma-3-27B | Gemma-3-12B | o1-1217 | QVQ-72B | Kimi-k1.5 | Kimi-VL-Thinking-A3B |
63
  |---------------------------------|--------|-------------|----------------|---------------|-------------|-------------|---------|----------|-----------|----------------------|
64
  | *Thinking Model?* | | | | | | | ✅ | ✅ | ✅ | ✅ |
 
66
  | MathVista (mini) (Pass@1) | 63.8 | 56.7 | 74.8 | 68.2 | 62.3 | 56.4 | 71.0 | 71.4 | 74.9 | 71.3 |
67
  | MMMU (val) (Pass@1) | 69.1 | 60.0 | 74.8 | 58.6 | 64.8 | 59.6 | 77.3 | 70.3 | 70.0 | 61.7 |
68
 
 
69
  </div>
70
 
71
  ### Inference with 🤗 Hugging Face Transformers
 
111
 
112
  We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM. You are welcome to deploy Kimi-VL using the branch corresponding to the vLLM MR until the MR is merged.
113
 
114
+ ## 8. Citation
115
 
116
  ```
117
  @misc{kimiteam2025kimivltechnicalreport,
 
123
  primaryClass={cs.CV},
124
  url={https://arxiv.org/abs/2504.07491},
125
  }
126
+ ```