zhouzaida
upload
704b5c8
|
raw
history blame
11.9 kB
metadata
license: mit
base_model:
  - moonshotai/Moonlight-16B-A3B
pipeline_tag: image-text-to-text

Introduction

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities—all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B).

Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.

In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.

Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.

Building on this foundation, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal thinking models.

Architecture

The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.

Model Variants

🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend Kimi-VL-A3B-Instruct for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using Kimi-VL-A3B-Thinking.

Model #Total Params #Activated Params Context Length Download Link
Kimi-VL-A3B-Instruct 16B 3B 128K 🤗 Hugging Face
Kimi-VL-A3B-Thinking 16B 3B 128K 🤗 Hugging Face

Performance

As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).

A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):

Full comparison (GPT-4o included for reference):

Benchmark (Metric) GPT-4o GPT-4o-Mini Qwen2.5-VL-7B Llama3.2-11B-Inst. Gemma3-12B-IT DeepSeek-VL2 Kimi-VL-A3B-Instruct
Architecture - - Dense Dense Dense MoE MoE
# Act. Params (LLM+VT) - - 7.6B+0.7B 8B+2.6B 12B+0.4B 4.1B+0.4B 2.8B+0.4B
# Total Params - - 8B 11B 12B 28B 16B
College-level
MMMU-Val (Pass@1) 69.1 60.0 58.6 48 59.6 51.1 57.0
VideoMMMU (Pass@1) 61.2 - 47.4 41.8 57.2 44.4 52.6
MMVU-Val (Pass@1) 67.4 61.6 50.1 44.4 57.0 52.1 52.2
General
MMBench-EN-v1.1 (Acc) 83.1 77.1 82.6 65.8 74.6 79.6 83.1
MMStar (Acc) 64.7 54.8 63.9 49.8 56.1 55.5 61.3
MMVet (Pass@1) 69.1 66.9 67.1 57.6 64.9 60.0 66.7
RealWorldQA (Acc) 75.4 67.1 68.5 63.3 59.1 68.4 68.1
AI2D (Acc) 84.6 77.8 83.9 77.3 78.1 81.4 84.9
Multi-image
BLINK (Acc) 68.0 53.6 56.4 39.8 50.3 - 57.3
Math
MathVista (Pass@1) 63.8 52.5 68.2 47.7 56.1 62.8 68.7
MathVision (Pass@1) 30.4 - 25.1 13.6 32.1 17.3 21.4
OCR
InfoVQA (Acc) 80.7 57.9 82.6 34.6 43.8 78.1 83.2
OCRBench (Acc) 815 785 864 753 702 811 867
OS Agent
ScreenSpot-V2 (Acc) 18.1 6.9 84.2 - - - 92.8
ScreenSpot-Pro (Acc) 0.8 - 29.0 - - - 34.5
OSWorld (Pass@1) 5.03 - 2.5 - - - 8.22
WindowsAgentArena (Pass@1) 9.4 2.7 3.4 - - - 10.4
Long Document
MMLongBench-Doc (Acc) 42.8 29.0 29.6 13.8 21.3 - 35.1
Long Video
Video-MME (w/o sub.) 71.9 64.8 65.1 46.0 58.2 - 67.8
Video-MME (w sub.) 77.2 68.9 71.6 49.5 62.1 - 72.6
MLVU-MCQ (Acc) 64.6 48.1 70.2 44.4 52.3 - 74.2
LongVideoBench (val) 66.7 58.2 56.0 45.5 51.5 - 64.5
Video Perception
EgoSchema (full) 72.2 - 65.0 54.3 56.9 38.5 78.5
VSI-Bench 34.0 - 34.2 20.6 32.4 21.7 37.4
TOMATO 37.7 28.8 27.6 21.5 28.6 27.2 31.7

Inference with 🤗 Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "moonshotai/Kimi-VL-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_path = "./figures/demo.png"
image = Image.open(image_path)
messages = [
    {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

Inference with VLLM

Coming soon!