|
--- |
|
license: mit |
|
base_model: |
|
- moonshotai/Moonlight-16B-A3B |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
|
|
|
|
<div align="center"> |
|
<img width="30%" src="figures/logo.png"> |
|
</div> |
|
|
|
|
|
## Introduction |
|
|
|
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B). |
|
|
|
Kimi-VL demonstrates strong performance across challenging domains: |
|
as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. |
|
Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc. |
|
|
|
In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains. |
|
|
|
Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks. |
|
|
|
Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models. |
|
|
|
## Architecture |
|
|
|
The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image. |
|
|
|
<div align="center"> |
|
<img width="90%" src="figures/arch.png"> |
|
</div> |
|
|
|
## Model Variants |
|
|
|
🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`. |
|
|
|
<div align="center"> |
|
|
|
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** | |
|
| :------------: | :------------: | :------------: | :------------: | :------------: | |
|
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) | |
|
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) | |
|
|
|
</div> |
|
|
|
## Performance |
|
|
|
As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc). |
|
|
|
|
|
A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B): |
|
|
|
<div align="center"> |
|
<img width="100%" src="figures/instruct_perf.png"> |
|
</div> |
|
|
|
Full comparison (GPT-4o included for reference): |
|
|
|
<div align="center"> |
|
|
|
| Benchmark (Metric) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct | |
|
|--------------------------------|--------|-------------|---------------|--------------------|---------------|--------------|-------------| |
|
| **Architecture** | - | - | Dense | Dense | Dense | MoE | MoE | |
|
| **# Act. Params (LLM+VT)** | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B | |
|
| **# Total Params** | - | - | 8B | 11B | 12B | 28B | 16B | |
|
| | | | | | | | | |
|
| **College-level** | | | | | | | | |
|
| MMMU-Val (Pass@1) | *69.1* | **60.0** | 58.6 | 48 | 59.6 | 51.1 | 57.0 | |
|
| VideoMMMU (Pass@1) | *61.2* | - | 47.4 | 41.8 | **57.2** | 44.4 | 52.6 | |
|
| MMVU-Val (Pass@1) | *67.4* | **61.6** | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 | |
|
| | | | | | | | | |
|
| **General** | | | | | | | | |
|
| MMBench-EN-v1.1 (Acc) | *83.1* | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | **83.1** | |
|
| MMStar (Acc) | *64.7* | 54.8 | **63.9** | 49.8 | 56.1 | 55.5 | 61.3 | |
|
| MMVet (Pass@1) | *69.1* | 66.9 | **67.1** | 57.6 | 64.9 | 60.0 | 66.7 | |
|
| RealWorldQA (Acc) | *75.4* | 67.1 | **68.5** | 63.3 | 59.1 | 68.4 | 68.1 | |
|
| AI2D (Acc) | *84.6* | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | **84.9** | |
|
| | | | | | | | | |
|
| **Multi-image** | | | | | | | | |
|
| BLINK (Acc) | *68.0* | 53.6 | 56.4 | 39.8 | 50.3 | - | **57.3** | |
|
| | | | | | | | | |
|
| **Math** | | | | | | | | |
|
| MathVista (Pass@1) | *63.8* | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | **68.7** | |
|
| MathVision (Pass@1) | *30.4* | - | 25.1 | 13.6 | **32.1** | 17.3 | 21.4 | |
|
| | | | | | | | | |
|
| **OCR** | | | | | | | | |
|
| InfoVQA (Acc) | *80.7* | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | **83.2** | |
|
| OCRBench (Acc) | *815* | 785 | 864 | 753 | 702 | 811 | **867** | |
|
| | | | | | | | | |
|
| **OS Agent** | | | | | | | | |
|
| ScreenSpot-V2 (Acc) | *18.1* | 6.9 | 84.2 | - | - | - | **92.8** | |
|
| ScreenSpot-Pro (Acc) | *0.8* | - | 29.0 | - | - | - | **34.5** | |
|
| OSWorld (Pass@1) | *5.03* | - | 2.5 | - | - | - | **8.22** | |
|
| WindowsAgentArena (Pass@1) | *9.4* | 2.7 | 3.4 | - | - | - | **10.4** | |
|
| | | | | | | | | |
|
| **Long Document** | | | | | | | | |
|
| MMLongBench-Doc (Acc) | *42.8* | 29.0 | 29.6 | 13.8 | 21.3 | - | **35.1** | |
|
| | | | | | | | | |
|
| **Long Video** | | | | | | | | |
|
| Video-MME (w/o sub.) | *71.9* | 64.8 | 65.1 | 46.0 | 58.2 | - | **67.8** | |
|
| Video-MME (w sub.) | *77.2* | 68.9 | 71.6 | 49.5 | 62.1 | - | **72.6** | |
|
| MLVU-MCQ (Acc) | *64.6* | 48.1 | 70.2 | 44.4 | 52.3 | - | **74.2** | |
|
| LongVideoBench (val) | *66.7* | 58.2 | 56.0 | 45.5 | 51.5 | - | **64.5** | |
|
| | | | | | | | | |
|
| **Video Perception** | | | | | | | | |
|
| EgoSchema (full) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | **78.5** | |
|
| VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | **37.4** | |
|
| TOMATO | *37.7* | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | **31.7** | |
|
|
|
</div> |
|
|
|
### Inference with 🤗 Hugging Face Transformers |
|
|
|
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment. |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
model_path = "moonshotai/Kimi-VL-A3B-Instruct" |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_path, |
|
torch_dtype="auto", |
|
device_map="auto", |
|
trust_remote_code=True, |
|
) |
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
image_path = "./figures/demo.png" |
|
image = Image.open(image_path) |
|
messages = [ |
|
{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]} |
|
] |
|
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") |
|
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device) |
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
generated_ids_trimmed = [ |
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
] |
|
response = processor.batch_decode( |
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
)[0] |
|
print(response) |
|
``` |
|
|
|
### Inference with VLLM |
|
|
|
Coming soon! |
|
|