README.md · moonshotai/Kimi-VL-A3B-Instruct at 704b5c8dd92ce59334c94107efcbd8cf00b46832

metadata

license: mit
base_model:
  - moonshotai/Moonlight-16B-A3B
pipeline_tag: image-text-to-text

Introduction

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities—all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B).

Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.

In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.

Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.

Building on this foundation, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal thinking models.

Architecture

The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.

Model Variants

🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend Kimi-VL-A3B-Instruct for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using Kimi-VL-A3B-Thinking.

Model	#Total Params	#Activated Params	Context Length	Download Link
Kimi-VL-A3B-Instruct	16B	3B	128K	🤗 Hugging Face
Kimi-VL-A3B-Thinking	16B	3B	128K	🤗 Hugging Face

Performance

As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).

A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):

Full comparison (GPT-4o included for reference):

Benchmark (Metric)	GPT-4o	GPT-4o-Mini	Qwen2.5-VL-7B	Llama3.2-11B-Inst.	Gemma3-12B-IT	DeepSeek-VL2	Kimi-VL-A3B-Instruct
Architecture	-	-	Dense	Dense	Dense	MoE	MoE
# Act. Params (LLM+VT)	-	-	7.6B+0.7B	8B+2.6B	12B+0.4B	4.1B+0.4B	2.8B+0.4B
# Total Params	-	-	8B	11B	12B	28B	16B

College-level
MMMU-Val (Pass@1)	69.1	60.0	58.6	48	59.6	51.1	57.0
VideoMMMU (Pass@1)	61.2	-	47.4	41.8	57.2	44.4	52.6
MMVU-Val (Pass@1)	67.4	61.6	50.1	44.4	57.0	52.1	52.2

General
MMBench-EN-v1.1 (Acc)	83.1	77.1	82.6	65.8	74.6	79.6	83.1
MMStar (Acc)	64.7	54.8	63.9	49.8	56.1	55.5	61.3
MMVet (Pass@1)	69.1	66.9	67.1	57.6	64.9	60.0	66.7
RealWorldQA (Acc)	75.4	67.1	68.5	63.3	59.1	68.4	68.1
AI2D (Acc)	84.6	77.8	83.9	77.3	78.1	81.4	84.9

Multi-image
BLINK (Acc)	68.0	53.6	56.4	39.8	50.3	-	57.3

Math
MathVista (Pass@1)	63.8	52.5	68.2	47.7	56.1	62.8	68.7
MathVision (Pass@1)	30.4	-	25.1	13.6	32.1	17.3	21.4

OCR
InfoVQA (Acc)	80.7	57.9	82.6	34.6	43.8	78.1	83.2
OCRBench (Acc)	815	785	864	753	702	811	867

OS Agent
ScreenSpot-V2 (Acc)	18.1	6.9	84.2	-	-	-	92.8
ScreenSpot-Pro (Acc)	0.8	-	29.0	-	-	-	34.5
OSWorld (Pass@1)	5.03	-	2.5	-	-	-	8.22
WindowsAgentArena (Pass@1)	9.4	2.7	3.4	-	-	-	10.4

Long Document
MMLongBench-Doc (Acc)	42.8	29.0	29.6	13.8	21.3	-	35.1

Long Video
Video-MME (w/o sub.)	71.9	64.8	65.1	46.0	58.2	-	67.8
Video-MME (w sub.)	77.2	68.9	71.6	49.5	62.1	-	72.6
MLVU-MCQ (Acc)	64.6	48.1	70.2	44.4	52.3	-	74.2
LongVideoBench (val)	66.7	58.2	56.0	45.5	51.5	-	64.5

Video Perception
EgoSchema (full)	72.2	-	65.0	54.3	56.9	38.5	78.5
VSI-Bench	34.0	-	34.2	20.6	32.4	21.7	37.4
TOMATO	37.7	28.8	27.6	21.5	28.6	27.2	31.7

Inference with 🤗 Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "moonshotai/Kimi-VL-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "./figures/demo.png"
image = Image.open(image_path)
messages = [
    {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

Inference with VLLM

Coming soon!