zhouzaida

upload

704b5c8 about 1 month ago

11.9 kB

	---
	license: mit
	base_model:
	- moonshotai/Moonlight-16B-A3B
	pipeline_tag: image-text-to-text
	---



	<div align="center">
	<img width="30%" src="figures/logo.png">
	</div>


	## Introduction

	We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities—all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B).

	Kimi-VL demonstrates strong performance across challenging domains:
	as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models.
	Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.

	In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.

	Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.

	Building on this foundation, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal thinking models.

	## Architecture

	The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.

	<div align="center">
	<img width="90%" src="figures/arch.png">
	</div>

	## Model Variants

	🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.

	<div align="center">

	\| Model \| #Total Params \| #Activated Params \| Context Length \| Download Link \|
	\| :------------: \| :------------: \| :------------: \| :------------: \| :------------: \|
	\| Kimi-VL-A3B-Instruct \| 16B \| 3B \| 128K \| [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) \|
	\| Kimi-VL-A3B-Thinking \| 16B \| 3B \| 128K \| [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) \|

	</div>

	## Performance

	As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).


	A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):

	<div align="center">
	<img width="100%" src="figures/instruct_perf.png">
	</div>

	Full comparison (GPT-4o included for reference):

	<div align="center">

	\| Benchmark (Metric) \| GPT-4o \| GPT-4o-Mini \| Qwen2.5-VL-7B \| Llama3.2-11B-Inst. \| Gemma3-12B-IT \| DeepSeek-VL2 \| Kimi-VL-A3B-Instruct \|
	\|--------------------------------\|--------\|-------------\|---------------\|--------------------\|---------------\|--------------\|-------------\|
	\| Architecture \| - \| - \| Dense \| Dense \| Dense \| MoE \| MoE \|
	\| # Act. Params (LLM+VT) \| - \| - \| 7.6B+0.7B \| 8B+2.6B \| 12B+0.4B \| 4.1B+0.4B \| 2.8B+0.4B \|
	\| # Total Params \| - \| - \| 8B \| 11B \| 12B \| 28B \| 16B \|
	\| \| \| \| \| \| \| \| \|
	\| College-level \| \| \| \| \| \| \| \|
	\| MMMU-Val (Pass@1) \| 69.1 \| 60.0 \| 58.6 \| 48 \| 59.6 \| 51.1 \| 57.0 \|
	\| VideoMMMU (Pass@1) \| 61.2 \| - \| 47.4 \| 41.8 \| 57.2 \| 44.4 \| 52.6 \|
	\| MMVU-Val (Pass@1) \| 67.4 \| 61.6 \| 50.1 \| 44.4 \| 57.0 \| 52.1 \| 52.2 \|
	\| \| \| \| \| \| \| \| \|
	\| General \| \| \| \| \| \| \| \|
	\| MMBench-EN-v1.1 (Acc) \| 83.1 \| 77.1 \| 82.6 \| 65.8 \| 74.6 \| 79.6 \| 83.1 \|
	\| MMStar (Acc) \| 64.7 \| 54.8 \| 63.9 \| 49.8 \| 56.1 \| 55.5 \| 61.3 \|
	\| MMVet (Pass@1) \| 69.1 \| 66.9 \| 67.1 \| 57.6 \| 64.9 \| 60.0 \| 66.7 \|
	\| RealWorldQA (Acc) \| 75.4 \| 67.1 \| 68.5 \| 63.3 \| 59.1 \| 68.4 \| 68.1 \|
	\| AI2D (Acc) \| 84.6 \| 77.8 \| 83.9 \| 77.3 \| 78.1 \| 81.4 \| 84.9 \|
	\| \| \| \| \| \| \| \| \|
	\| Multi-image \| \| \| \| \| \| \| \|
	\| BLINK (Acc) \| 68.0 \| 53.6 \| 56.4 \| 39.8 \| 50.3 \| - \| 57.3 \|
	\| \| \| \| \| \| \| \| \|
	\| Math \| \| \| \| \| \| \| \|
	\| MathVista (Pass@1) \| 63.8 \| 52.5 \| 68.2 \| 47.7 \| 56.1 \| 62.8 \| 68.7 \|
	\| MathVision (Pass@1) \| 30.4 \| - \| 25.1 \| 13.6 \| 32.1 \| 17.3 \| 21.4 \|
	\| \| \| \| \| \| \| \| \|
	\| OCR \| \| \| \| \| \| \| \|
	\| InfoVQA (Acc) \| 80.7 \| 57.9 \| 82.6 \| 34.6 \| 43.8 \| 78.1 \| 83.2 \|
	\| OCRBench (Acc) \| 815 \| 785 \| 864 \| 753 \| 702 \| 811 \| 867 \|
	\| \| \| \| \| \| \| \| \|
	\| OS Agent \| \| \| \| \| \| \| \|
	\| ScreenSpot-V2 (Acc) \| 18.1 \| 6.9 \| 84.2 \| - \| - \| - \| 92.8 \|
	\| ScreenSpot-Pro (Acc) \| 0.8 \| - \| 29.0 \| - \| - \| - \| 34.5 \|
	\| OSWorld (Pass@1) \| 5.03 \| - \| 2.5 \| - \| - \| - \| 8.22 \|
	\| WindowsAgentArena (Pass@1) \| 9.4 \| 2.7 \| 3.4 \| - \| - \| - \| 10.4 \|
	\| \| \| \| \| \| \| \| \|
	\| Long Document \| \| \| \| \| \| \| \|
	\| MMLongBench-Doc (Acc) \| 42.8 \| 29.0 \| 29.6 \| 13.8 \| 21.3 \| - \| 35.1 \|
	\| \| \| \| \| \| \| \| \|
	\| Long Video \| \| \| \| \| \| \| \|
	\| Video-MME (w/o sub.) \| 71.9 \| 64.8 \| 65.1 \| 46.0 \| 58.2 \| - \| 67.8 \|
	\| Video-MME (w sub.) \| 77.2 \| 68.9 \| 71.6 \| 49.5 \| 62.1 \| - \| 72.6 \|
	\| MLVU-MCQ (Acc) \| 64.6 \| 48.1 \| 70.2 \| 44.4 \| 52.3 \| - \| 74.2 \|
	\| LongVideoBench (val) \| 66.7 \| 58.2 \| 56.0 \| 45.5 \| 51.5 \| - \| 64.5 \|
	\| \| \| \| \| \| \| \| \|
	\| Video Perception \| \| \| \| \| \| \| \|
	\| EgoSchema (full) \| 72.2 \| - \| 65.0 \| 54.3 \| 56.9 \| 38.5 \| 78.5 \|
	\| VSI-Bench \| 34.0 \| - \| 34.2 \| 20.6 \| 32.4 \| 21.7 \| 37.4 \|
	\| TOMATO \| 37.7 \| 28.8 \| 27.6 \| 21.5 \| 28.6 \| 27.2 \| 31.7 \|

	</div>

	### Inference with 🤗 Hugging Face Transformers

	We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.

	```python
	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoProcessor

	model_path = "moonshotai/Kimi-VL-A3B-Instruct"
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True,
	)
	processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

	image_path = "./figures/demo.png"
	image = Image.open(image_path)
	messages = [
	{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
	]
	text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
	inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
	generated_ids = model.generate(**inputs, max_new_tokens=512)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	response = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(response)
	```

	### Inference with VLLM

	Coming soon!