---
license: mit
base_model:
- moonshotai/Moonlight-16B-A3B
pipeline_tag: image-text-to-text
---
## Introduction
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**βall while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
Kimi-VL demonstrates strong performance across challenging domains:
as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models.
Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.
In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.
Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.
Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
## Architecture
The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
## Model Variants
π€ For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
| :------------: | :------------: | :------------: | :------------: | :------------: |
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [π€ Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [π€ Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
## Performance
As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):
Full comparison (GPT-4o included for reference):
| Benchmark (Metric) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct |
|--------------------------------|--------|-------------|---------------|--------------------|---------------|--------------|-------------|
| **Architecture** | - | - | Dense | Dense | Dense | MoE | MoE |
| **# Act. Params (LLM+VT)** | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B |
| **# Total Params** | - | - | 8B | 11B | 12B | 28B | 16B |
| | | | | | | | |
| **College-level** | | | | | | | |
| MMMU-Val (Pass@1) | *69.1* | **60.0** | 58.6 | 48 | 59.6 | 51.1 | 57.0 |
| VideoMMMU (Pass@1) | *61.2* | - | 47.4 | 41.8 | **57.2** | 44.4 | 52.6 |
| MMVU-Val (Pass@1) | *67.4* | **61.6** | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 |
| | | | | | | | |
| **General** | | | | | | | |
| MMBench-EN-v1.1 (Acc) | *83.1* | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | **83.1** |
| MMStar (Acc) | *64.7* | 54.8 | **63.9** | 49.8 | 56.1 | 55.5 | 61.3 |
| MMVet (Pass@1) | *69.1* | 66.9 | **67.1** | 57.6 | 64.9 | 60.0 | 66.7 |
| RealWorldQA (Acc) | *75.4* | 67.1 | **68.5** | 63.3 | 59.1 | 68.4 | 68.1 |
| AI2D (Acc) | *84.6* | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | **84.9** |
| | | | | | | | |
| **Multi-image** | | | | | | | |
| BLINK (Acc) | *68.0* | 53.6 | 56.4 | 39.8 | 50.3 | - | **57.3** |
| | | | | | | | |
| **Math** | | | | | | | |
| MathVista (Pass@1) | *63.8* | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | **68.7** |
| MathVision (Pass@1) | *30.4* | - | 25.1 | 13.6 | **32.1** | 17.3 | 21.4 |
| | | | | | | | |
| **OCR** | | | | | | | |
| InfoVQA (Acc) | *80.7* | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | **83.2** |
| OCRBench (Acc) | *815* | 785 | 864 | 753 | 702 | 811 | **867** |
| | | | | | | | |
| **OS Agent** | | | | | | | |
| ScreenSpot-V2 (Acc) | *18.1* | 6.9 | 84.2 | - | - | - | **92.8** |
| ScreenSpot-Pro (Acc) | *0.8* | - | 29.0 | - | - | - | **34.5** |
| OSWorld (Pass@1) | *5.03* | - | 2.5 | - | - | - | **8.22** |
| WindowsAgentArena (Pass@1) | *9.4* | 2.7 | 3.4 | - | - | - | **10.4** |
| | | | | | | | |
| **Long Document** | | | | | | | |
| MMLongBench-Doc (Acc) | *42.8* | 29.0 | 29.6 | 13.8 | 21.3 | - | **35.1** |
| | | | | | | | |
| **Long Video** | | | | | | | |
| Video-MME (w/o sub.) | *71.9* | 64.8 | 65.1 | 46.0 | 58.2 | - | **67.8** |
| Video-MME (w sub.) | *77.2* | 68.9 | 71.6 | 49.5 | 62.1 | - | **72.6** |
| MLVU-MCQ (Acc) | *64.6* | 48.1 | 70.2 | 44.4 | 52.3 | - | **74.2** |
| LongVideoBench (val) | *66.7* | 58.2 | 56.0 | 45.5 | 51.5 | - | **64.5** |
| | | | | | | | |
| **Video Perception** | | | | | | | |
| EgoSchema (full) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | **78.5** |
| VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | **37.4** |
| TOMATO | *37.7* | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | **31.7** |
### Inference with π€ Hugging Face Transformers
We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
```python
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_path = "moonshotai/Kimi-VL-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
image_path = "./figures/demo.png"
image = Image.open(image_path)
messages = [
{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
### Inference with VLLM
Coming soon!