|
--- |
|
base_model: |
|
- moonshotai/Moonlight-16B-A3B |
|
license: mit |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
<div align="center"> |
|
<a href="Kimi-VL.pdf">KIMI-VL TECHNICAL REPORT</a> |
|
</div> |
|
|
|
<div align="center"> |
|
<a href="https://arxiv.org/abs/2504.07491"><img src="figures/logo.png" height="16" width="16" style="vertical-align:middle"><b> Tech Report</b></a> | |
|
<a href="https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="vertical-align:middle"><b> HuggingFace</b> |
|
</a> | |
|
<a href="https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/">💬 Chat Web</a> |
|
</div> |
|
|
|
|
|
## Introduction |
|
|
|
We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across diverse challenging vision-language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and more. It effectively competes with cutting-edge efficient VLMs like GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, even surpassing GPT-4o in several specialized domains. Kimi-VL also excels in processing long contexts and high-resolution images, achieving impressive results on benchmarks like LongVideoBench, MMLongBench-Doc, InfoVQA, and ScreenSpot-Pro. We also introduce **Kimi-VL-Thinking**, a variant fine-tuned for long-horizon reasoning, achieving high scores on MMMU, MathVision, and MathVista with a compact 2.8B activated LLM parameter footprint. |
|
|
|
|
|
## Architecture |
|
|
|
Kimi-VL uses a Mixture-of-Experts (MoE) language model, a native-resolution visual encoder (MoonViT), and an MLP projector. |
|
|
|
<div align="center"> |
|
<img width="90%" src="figures/arch.png"> |
|
</div> |
|
|
|
## Model Variants |
|
|
|
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** | |
|
| :------------: | :------------: | :------------: | :------------: | :------------: | |
|
| Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) | |
|
| Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) | |
|
|
|
For general multimodal tasks, OCR, long video/document understanding, video perception, and agent applications, we recommend `Kimi-VL-A3B-Instruct`. For advanced text and multimodal reasoning (e.g., math), use `Kimi-VL-A3B-Thinking`. You can also chat with the `Kimi-VL-A3B-Thinking` model on our [Huggingface Demo](https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/). |
|
|
|
|
|
## Performance |
|
|
|
Kimi-VL robustly handles diverse tasks (perception, math, college-level problems, OCR, agent interaction) across various input formats (image, multi-image, video, long-document). See the Tech Report for detailed benchmark results. A brief comparison with other models: |
|
|
|
<div align="center"> |
|
<img width="100%" src="figures/instruct_perf.png"> |
|
</div> |
|
|
|
|
|
## Example Usage (Transformers) |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
|
model_path = "moonshotai/Kimi-VL-A3B-Instruct" |
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
image_path = "./figures/demo.png" |
|
image = Image.open(image_path) |
|
messages = [ |
|
{"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]} |
|
] |
|
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") |
|
inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device) |
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)] |
|
response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] |
|
print(response) |
|
``` |
|
|
|
## Deployment (vLLM) |
|
|
|
We have submitted a Merge Request [#16387](https://github.com/vllm-project/vllm/pull/16387) to vLLM for easier deployment. |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{kimiteam2025kimivltechnicalreport, |
|
title={{Kimi-VL} Technical Report}, |
|
author={Kimi Team and ...}, |
|
year={2025}, |
|
eprint={2504.07491}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2504.07491}, |
|
} |
|
``` |