zhouzaida commited on
Commit
704b5c8
·
1 Parent(s): e347507
.gitattributes CHANGED
@@ -33,4 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
36
  figures/*.png filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ arch.png filter=lfs diff=lfs merge=lfs -text
37
+ instruct_perf.png filter=lfs diff=lfs merge=lfs -text
38
+ thinking_perf.png filter=lfs diff=lfs merge=lfs -text
39
  figures/*.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,152 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - moonshotai/Moonlight-16B-A3B
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+
9
+
10
+ <div align="center">
11
+ <img width="30%" src="figures/logo.png">
12
+ </div>
13
+
14
+
15
+ ## Introduction
16
+
17
+ We present **Kimi-VL**, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers **advanced multimodal reasoning, long-context understanding, and strong agent capabilities**—all while activating only **2.8B** parameters in its language decoder (Kimi-VL-A3B).
18
+
19
+ Kimi-VL demonstrates strong performance across challenging domains:
20
+ as a general-purpose VLM, Kimi-VL excels in multi-turn agent interaction tasks (e.g.,OSWorld), achieving state-of-the-art results comparable to flagship models.
21
+ Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, optical character recognition (OCR), mathematical reasoning, multi-image understanding, and etc.
22
+
23
+ In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several specialized domains.
24
+
25
+ Kimi-VL also advances the pareto frontiers of multimodal models in processing long contexts and perceiving clearly: Equipped with a 128K extended context window, Kimi-VL can processes long and diverse inputs, achieving impressive scores of 64.5 on LongVideoBench, and 35.1 on MMLongBench-Doc; Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost with common visual inputs and general tasks.
26
+
27
+ Building on this foundation, we introduce an advanced long-thinking variant: **Kimi-VL-Thinking**. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameter footprint, setting a new standard for efficient yet capable multimodal **thinking** models.
28
+
29
+ ## Architecture
30
+
31
+ The model adopts an MoE language model, a native-resolution visual encoder (MoonViT), and an MLP projector, as illustrated in the following image.
32
+
33
+ <div align="center">
34
+ <img width="90%" src="figures/arch.png">
35
+ </div>
36
+
37
+ ## Model Variants
38
+
39
+ 🤗 For general multimodal perception and understanding, OCR, long video and long document, video perception, and agent uses, we recommend `Kimi-VL-A3B-Instruct` for efficient inference; for advanced text and multimodal reasoning (e.g. math), please consider using `Kimi-VL-A3B-Thinking`.
40
+
41
+ <div align="center">
42
+
43
+ | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download Link** |
44
+ | :------------: | :------------: | :------------: | :------------: | :------------: |
45
+ | Kimi-VL-A3B-Instruct | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Instruct) |
46
+ | Kimi-VL-A3B-Thinking | 16B | 3B | 128K | [🤗 Hugging Face](https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking) |
47
+
48
+ </div>
49
+
50
+ ## Performance
51
+
52
+ As an efficient model, Kimi-VL can robustly handle diverse tasks (fine-grained perception, math, college-level problems, OCR, agent, etc) across a broad spectrum of input forms (single-image, multi-image, video, long-document, etc).
53
+
54
+
55
+ A brief comparison with existing 10B-level dense VLMs and DeepSeek-VL2 (A4.5B):
56
+
57
+ <div align="center">
58
+ <img width="100%" src="figures/instruct_perf.png">
59
+ </div>
60
+
61
+ Full comparison (GPT-4o included for reference):
62
+
63
+ <div align="center">
64
+
65
+ | Benchmark (Metric) | GPT-4o | GPT-4o-Mini | Qwen2.5-VL-7B | Llama3.2-11B-Inst. | Gemma3-12B-IT | DeepSeek-VL2 | Kimi-VL-A3B-Instruct |
66
+ |--------------------------------|--------|-------------|---------------|--------------------|---------------|--------------|-------------|
67
+ | **Architecture** | - | - | Dense | Dense | Dense | MoE | MoE |
68
+ | **# Act. Params (LLM+VT)** | - | - | 7.6B+0.7B | 8B+2.6B | 12B+0.4B | 4.1B+0.4B | 2.8B+0.4B |
69
+ | **# Total Params** | - | - | 8B | 11B | 12B | 28B | 16B |
70
+ | | | | | | | | |
71
+ | **College-level** | | | | | | | |
72
+ | MMMU-Val (Pass@1) | *69.1* | **60.0** | 58.6 | 48 | 59.6 | 51.1 | 57.0 |
73
+ | VideoMMMU (Pass@1) | *61.2* | - | 47.4 | 41.8 | **57.2** | 44.4 | 52.6 |
74
+ | MMVU-Val (Pass@1) | *67.4* | **61.6** | 50.1 | 44.4 | 57.0 | 52.1 | 52.2 |
75
+ | | | | | | | | |
76
+ | **General** | | | | | | | |
77
+ | MMBench-EN-v1.1 (Acc) | *83.1* | 77.1 | 82.6 | 65.8 | 74.6 | 79.6 | **83.1** |
78
+ | MMStar (Acc) | *64.7* | 54.8 | **63.9** | 49.8 | 56.1 | 55.5 | 61.3 |
79
+ | MMVet (Pass@1) | *69.1* | 66.9 | **67.1** | 57.6 | 64.9 | 60.0 | 66.7 |
80
+ | RealWorldQA (Acc) | *75.4* | 67.1 | **68.5** | 63.3 | 59.1 | 68.4 | 68.1 |
81
+ | AI2D (Acc) | *84.6* | 77.8 | 83.9 | 77.3 | 78.1 | 81.4 | **84.9** |
82
+ | | | | | | | | |
83
+ | **Multi-image** | | | | | | | |
84
+ | BLINK (Acc) | *68.0* | 53.6 | 56.4 | 39.8 | 50.3 | - | **57.3** |
85
+ | | | | | | | | |
86
+ | **Math** | | | | | | | |
87
+ | MathVista (Pass@1) | *63.8* | 52.5 | 68.2 | 47.7 | 56.1 | 62.8 | **68.7** |
88
+ | MathVision (Pass@1) | *30.4* | - | 25.1 | 13.6 | **32.1** | 17.3 | 21.4 |
89
+ | | | | | | | | |
90
+ | **OCR** | | | | | | | |
91
+ | InfoVQA (Acc) | *80.7* | 57.9 | 82.6 | 34.6 | 43.8 | 78.1 | **83.2** |
92
+ | OCRBench (Acc) | *815* | 785 | 864 | 753 | 702 | 811 | **867** |
93
+ | | | | | | | | |
94
+ | **OS Agent** | | | | | | | |
95
+ | ScreenSpot-V2 (Acc) | *18.1* | 6.9 | 84.2 | - | - | - | **92.8** |
96
+ | ScreenSpot-Pro (Acc) | *0.8* | - | 29.0 | - | - | - | **34.5** |
97
+ | OSWorld (Pass@1) | *5.03* | - | 2.5 | - | - | - | **8.22** |
98
+ | WindowsAgentArena (Pass@1) | *9.4* | 2.7 | 3.4 | - | - | - | **10.4** |
99
+ | | | | | | | | |
100
+ | **Long Document** | | | | | | | |
101
+ | MMLongBench-Doc (Acc) | *42.8* | 29.0 | 29.6 | 13.8 | 21.3 | - | **35.1** |
102
+ | | | | | | | | |
103
+ | **Long Video** | | | | | | | |
104
+ | Video-MME (w/o sub.) | *71.9* | 64.8 | 65.1 | 46.0 | 58.2 | - | **67.8** |
105
+ | Video-MME (w sub.) | *77.2* | 68.9 | 71.6 | 49.5 | 62.1 | - | **72.6** |
106
+ | MLVU-MCQ (Acc) | *64.6* | 48.1 | 70.2 | 44.4 | 52.3 | - | **74.2** |
107
+ | LongVideoBench (val) | *66.7* | 58.2 | 56.0 | 45.5 | 51.5 | - | **64.5** |
108
+ | | | | | | | | |
109
+ | **Video Perception** | | | | | | | |
110
+ | EgoSchema (full) | 72.2 | - | 65.0 | 54.3 | 56.9 | 38.5 | **78.5** |
111
+ | VSI-Bench | 34.0 | - | 34.2 | 20.6 | 32.4 | 21.7 | **37.4** |
112
+ | TOMATO | *37.7* | 28.8 | 27.6 | 21.5 | 28.6 | 27.2 | **31.7** |
113
+
114
+ </div>
115
+
116
+ ### Inference with 🤗 Hugging Face Transformers
117
+
118
+ We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.
119
+
120
+ ```python
121
+ from PIL import Image
122
+ from transformers import AutoModelForCausalLM, AutoProcessor
123
+
124
+ model_path = "moonshotai/Kimi-VL-A3B-Instruct"
125
+ model = AutoModelForCausalLM.from_pretrained(
126
+ model_path,
127
+ torch_dtype="auto",
128
+ device_map="auto",
129
+ trust_remote_code=True,
130
+ )
131
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
132
+
133
+ image_path = "./figures/demo.png"
134
+ image = Image.open(image_path)
135
+ messages = [
136
+ {"role": "user", "content": [{"type": "image", "image": image_path}, {"type": "text", "text": "What is the dome building in the picture? Think step by step."}]}
137
+ ]
138
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
139
+ inputs = processor(images=image, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
140
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
141
+ generated_ids_trimmed = [
142
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
143
+ ]
144
+ response = processor.batch_decode(
145
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
146
+ )[0]
147
+ print(response)
148
+ ```
149
+
150
+ ### Inference with VLLM
151
+
152
+ Coming soon!
chat_template.jinja ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- for message in messages -%}
2
+ {%- if loop.first and messages[0]['role'] != 'system' -%}
3
+ {{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}
4
+ {%- endif -%}
5
+ {%- if message['role'] == 'system' -%}
6
+ {{'<|im_system|>'}}
7
+ {%- endif -%}
8
+ {%- if message['role'] == 'user' -%}
9
+ {{'<|im_user|>'}}
10
+ {%- endif -%}
11
+ {%- if message['role'] == 'assistant' -%}
12
+ {{'<|im_assistant|>'}}
13
+ {%- endif -%}
14
+ {{- message['role'] -}}
15
+ {{'<|im_middle|>'}}
16
+ {%- if message['content'] is string -%}
17
+ {{- message['content'] + '<|im_end|>' -}}
18
+ {%- else -%}
19
+ {%- for content in message['content'] -%}
20
+ {%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
21
+ {{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}
22
+ {%- else -%}
23
+ {{content['text']}}
24
+ {%- endif -%}
25
+ {%- endfor -%}
26
+ {{'<|im_end|>'}}
27
+ {%- endif -%}
28
+ {%- endfor -%}
29
+ {%- if add_generation_prompt -%}
30
+ {{'<|im_assistant|>assistant<|im_middle|>'}}
31
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "KimiVLForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_kimi_vl.KimiVLConfig",
7
+ "AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
8
+ "AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
9
+ },
10
+ "vision_config": {
11
+ "model_type": "moonvit",
12
+ "patch_size": 14,
13
+ "num_attention_heads": 16,
14
+ "num_hidden_layers": 27,
15
+ "hidden_size": 1152,
16
+ "intermediate_size": 4304,
17
+ "init_pos_emb_height": 64,
18
+ "init_pos_emb_width": 64,
19
+ "merge_kernel_size": [
20
+ 2,
21
+ 2
22
+ ]
23
+ },
24
+ "text_config": {
25
+ "vocab_size": 163840,
26
+ "max_position_embeddings": 131072,
27
+ "hidden_size": 2048,
28
+ "intermediate_size": 11264,
29
+ "moe_intermediate_size": 1408,
30
+ "num_hidden_layers": 27,
31
+ "num_attention_heads": 16,
32
+ "n_shared_experts": 2,
33
+ "n_routed_experts": 64,
34
+ "ep_size": 1,
35
+ "routed_scaling_factor": 2.446,
36
+ "kv_lora_rank": 512,
37
+ "q_lora_rank": null,
38
+ "qk_rope_head_dim": 64,
39
+ "v_head_dim": 128,
40
+ "qk_nope_head_dim": 128,
41
+ "topk_method": "noaux_tc",
42
+ "n_group": 1,
43
+ "topk_group": 1,
44
+ "num_experts_per_tok": 6,
45
+ "moe_layer_freq": 1,
46
+ "first_k_dense_replace": 1,
47
+ "norm_topk_prob": true,
48
+ "scoring_func": "sigmoid",
49
+ "aux_loss_alpha": 0.001,
50
+ "seq_aux": true,
51
+ "num_key_value_heads": 16,
52
+ "hidden_act": "silu",
53
+ "initializer_range": 0.02,
54
+ "rms_norm_eps": 1e-05,
55
+ "pretraining_tp": 1,
56
+ "use_cache": true,
57
+ "rope_theta": 800000.0,
58
+ "rope_scaling": null,
59
+ "attention_bias": false,
60
+ "attention_dropout": 0.0,
61
+ "bos_token_id": 163584,
62
+ "pad_token_id": 163839,
63
+ "eos_token_id": 163585,
64
+ "torch_dtype": "bfloat16",
65
+ "tie_word_embeddings": false
66
+ },
67
+ "ignore_index": -100,
68
+ "media_placeholder_token_id": 163605,
69
+ "torch_dtype": "bfloat16",
70
+ "transformers_version": "4.50.3",
71
+ "tie_word_embeddings": false,
72
+ "vocab_size": 163840,
73
+ "model_type": "kimi_vl"
74
+ }
configuration_kimi_vl.py ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.utils import logging
3
+ from typing import Optional, Union
4
+
5
+ logger = logging.get_logger(__name__)
6
+
7
+ DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
8
+
9
+ class DeepseekV3Config(PretrainedConfig):
10
+ r"""
11
+ This is the configuration class to store the configuration of a [`DeepseekV3Model`]. It is used to instantiate an DeepSeek
12
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
13
+ defaults will yield a similar configuration to that of the DeepSeek-V3.
14
+
15
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
16
+ documentation from [`PretrainedConfig`] for more information.
17
+
18
+ Copy from https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/configuration_deepseek.py
19
+
20
+ Args:
21
+ vocab_size (`int`, *optional*, defaults to 129280):
22
+ Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
23
+ `inputs_ids` passed when calling [`DeepseekV3Model`]
24
+ hidden_size (`int`, *optional*, defaults to 4096):
25
+ Dimension of the hidden representations.
26
+ intermediate_size (`int`, *optional*, defaults to 11008):
27
+ Dimension of the MLP representations.
28
+ moe_intermediate_size (`int`, *optional*, defaults to 1407):
29
+ Dimension of the MoE representations.
30
+ num_hidden_layers (`int`, *optional*, defaults to 32):
31
+ Number of hidden layers in the Transformer decoder.
32
+ num_nextn_predict_layers (`int`, *optional*, defaults to 1):
33
+ Number of nextn predict layers in the DeepSeekV3 Model.
34
+ num_attention_heads (`int`, *optional*, defaults to 32):
35
+ Number of attention heads for each attention layer in the Transformer decoder.
36
+ n_shared_experts (`int`, *optional*, defaults to None):
37
+ Number of shared experts, None means dense model.
38
+ n_routed_experts (`int`, *optional*, defaults to None):
39
+ Number of routed experts, None means dense model.
40
+ routed_scaling_factor (`float`, *optional*, defaults to 1.0):
41
+ Scaling factor or routed experts.
42
+ topk_method (`str`, *optional*, defaults to `gready`):
43
+ Topk method used in routed gate.
44
+ n_group (`int`, *optional*, defaults to None):
45
+ Number of groups for routed experts.
46
+ topk_group (`int`, *optional*, defaults to None):
47
+ Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
48
+ num_experts_per_tok (`int`, *optional*, defaults to None):
49
+ Number of selected experts, None means dense model.
50
+ moe_layer_freq (`int`, *optional*, defaults to 1):
51
+ The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
52
+ first_k_dense_replace (`int`, *optional*, defaults to 0):
53
+ Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
54
+ \--k dense layers--/
55
+ norm_topk_prob (`bool`, *optional*, defaults to False):
56
+ Whether to normalize the weights of the routed experts.
57
+ scoring_func (`str`, *optional*, defaults to 'softmax'):
58
+ Method of computing expert weights.
59
+ aux_loss_alpha (`float`, *optional*, defaults to 0.001):
60
+ Auxiliary loss weight coefficient.
61
+ seq_aux = (`bool`, *optional*, defaults to True):
62
+ Whether to compute the auxiliary loss for each individual sample.
63
+ num_key_value_heads (`int`, *optional*):
64
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
65
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
66
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
67
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
68
+ by meanpooling all the original heads within that group. For more details checkout [this
69
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
70
+ `num_attention_heads`.
71
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
72
+ The non-linear activation function (function or string) in the decoder.
73
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
74
+ The maximum sequence length that this model might ever be used with.
75
+ initializer_range (`float`, *optional*, defaults to 0.02):
76
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
77
+ rms_norm_eps (`float`, *optional*, defaults to 1e-06):
78
+ The epsilon used by the rms normalization layers.
79
+ use_cache (`bool`, *optional*, defaults to `True`):
80
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
81
+ relevant if `config.is_decoder=True`.
82
+ pad_token_id (`int`, *optional*):
83
+ Padding token id.
84
+ bos_token_id (`int`, *optional*, defaults to 1):
85
+ Beginning of stream token id.
86
+ eos_token_id (`int`, *optional*, defaults to 2):
87
+ End of stream token id.
88
+ pretraining_tp (`int`, *optional*, defaults to 1):
89
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
90
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
91
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
92
+ issue](https://github.com/pytorch/pytorch/issues/76232).
93
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
94
+ Whether to tie weight embeddings
95
+ rope_theta (`float`, *optional*, defaults to 10000.0):
96
+ The base period of the RoPE embeddings.
97
+ rope_scaling (`Dict`, *optional*):
98
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
99
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
100
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
101
+ `max_position_embeddings` to the expected new maximum.
102
+ attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
103
+ Whether to use a bias in the query, key, value and output projection layers during self-attention.
104
+ attention_dropout (`float`, *optional*, defaults to 0.0):
105
+ The dropout ratio for the attention probabilities.
106
+
107
+ ```python
108
+ >>> from transformers import DeepseekV3Model, DeepseekV3Config
109
+
110
+ >>> # Initializing a Deepseek-V3 style configuration
111
+ >>> configuration = DeepseekV3Config()
112
+
113
+ >>> # Accessing the model configuration
114
+ >>> configuration = model.config
115
+ ```"""
116
+
117
+ model_type = "deepseek_v3"
118
+ keys_to_ignore_at_inference = ["past_key_values"]
119
+
120
+ def __init__(
121
+ self,
122
+ vocab_size=129280,
123
+ hidden_size=7168,
124
+ intermediate_size=18432,
125
+ moe_intermediate_size = 2048,
126
+ num_hidden_layers=61,
127
+ num_nextn_predict_layers=1,
128
+ num_attention_heads=128,
129
+ num_key_value_heads=128,
130
+ n_shared_experts = 1,
131
+ n_routed_experts = 256,
132
+ ep_size = 1,
133
+ routed_scaling_factor = 2.5,
134
+ kv_lora_rank = 512,
135
+ q_lora_rank = 1536,
136
+ qk_rope_head_dim = 64,
137
+ v_head_dim = 128,
138
+ qk_nope_head_dim = 128,
139
+ topk_method = 'noaux_tc',
140
+ n_group = 8,
141
+ topk_group = 4,
142
+ num_experts_per_tok = 8,
143
+ moe_layer_freq = 1,
144
+ first_k_dense_replace = 3,
145
+ norm_topk_prob = True,
146
+ scoring_func = 'sigmoid',
147
+ aux_loss_alpha = 0.001,
148
+ seq_aux = True,
149
+ hidden_act="silu",
150
+ max_position_embeddings=4096,
151
+ initializer_range=0.02,
152
+ rms_norm_eps=1e-6,
153
+ use_cache=True,
154
+ pad_token_id=None,
155
+ bos_token_id=0,
156
+ eos_token_id=1,
157
+ pretraining_tp=1,
158
+ tie_word_embeddings=False,
159
+ rope_theta=10000.0,
160
+ rope_scaling=None,
161
+ attention_bias=False,
162
+ attention_dropout=0.0,
163
+ **kwargs,
164
+ ):
165
+ self.vocab_size = vocab_size
166
+ self.max_position_embeddings = max_position_embeddings
167
+ self.hidden_size = hidden_size
168
+ self.intermediate_size = intermediate_size
169
+ self.moe_intermediate_size = moe_intermediate_size
170
+ self.num_hidden_layers = num_hidden_layers
171
+ self.num_nextn_predict_layers = num_nextn_predict_layers
172
+ self.num_attention_heads = num_attention_heads
173
+ self.n_shared_experts = n_shared_experts
174
+ self.n_routed_experts = n_routed_experts
175
+ self.ep_size = ep_size
176
+ self.routed_scaling_factor = routed_scaling_factor
177
+ self.kv_lora_rank = kv_lora_rank
178
+ self.q_lora_rank = q_lora_rank
179
+ self.qk_rope_head_dim = qk_rope_head_dim
180
+ self.v_head_dim = v_head_dim
181
+ self.qk_nope_head_dim = qk_nope_head_dim
182
+ self.topk_method = topk_method
183
+ self.n_group = n_group
184
+ self.topk_group = topk_group
185
+ self.num_experts_per_tok = num_experts_per_tok
186
+ self.moe_layer_freq = moe_layer_freq
187
+ self.first_k_dense_replace = first_k_dense_replace
188
+ self.norm_topk_prob = norm_topk_prob
189
+ self.scoring_func = scoring_func
190
+ self.aux_loss_alpha = aux_loss_alpha
191
+ self.seq_aux = seq_aux
192
+ # for backward compatibility
193
+ if num_key_value_heads is None:
194
+ num_key_value_heads = num_attention_heads
195
+
196
+ self.num_key_value_heads = num_key_value_heads
197
+ self.hidden_act = hidden_act
198
+ self.initializer_range = initializer_range
199
+ self.rms_norm_eps = rms_norm_eps
200
+ self.pretraining_tp = pretraining_tp
201
+ self.use_cache = use_cache
202
+ self.rope_theta = rope_theta
203
+ self.rope_scaling = rope_scaling
204
+ self.attention_bias = attention_bias
205
+ self.attention_dropout = attention_dropout
206
+
207
+ super().__init__(
208
+ pad_token_id=pad_token_id,
209
+ bos_token_id=bos_token_id,
210
+ eos_token_id=eos_token_id,
211
+ tie_word_embeddings=tie_word_embeddings,
212
+ **kwargs,
213
+ )
214
+
215
+
216
+ class MoonViTConfig(PretrainedConfig):
217
+ model_type = "moonvit"
218
+
219
+ def __init__(
220
+ self,
221
+ patch_size: int = 14,
222
+ init_pos_emb_height: int = 64,
223
+ init_pos_emb_width: int = 64,
224
+ num_attention_heads: int = 16,
225
+ num_hidden_layers: int = 27,
226
+ hidden_size: int = 1152,
227
+ intermediate_size: int = 4304,
228
+ merge_kernel_size: tuple[int, int] = (2, 2),
229
+ **kwargs,
230
+ ):
231
+ super().__init__(**kwargs)
232
+ self.patch_size = patch_size
233
+ # Positional embedding config
234
+ self.init_pos_emb_height = init_pos_emb_height
235
+ self.init_pos_emb_width = init_pos_emb_width
236
+ # Transformer config
237
+ self.num_hidden_layers = num_hidden_layers
238
+ self.num_attention_heads = num_attention_heads
239
+ self.hidden_size = hidden_size
240
+ self.intermediate_size = intermediate_size
241
+ # Patch merger config
242
+ self.merge_kernel_size = merge_kernel_size
243
+
244
+
245
+ class KimiVLConfig(PretrainedConfig):
246
+ model_type = "kimi_vl"
247
+
248
+ def __init__(
249
+ self,
250
+ vision_config: Optional[Union[dict, MoonViTConfig]] = None,
251
+ text_config: Optional[Union[dict, DeepseekV3Config]] = None,
252
+ ignore_index: int = -100,
253
+ media_placeholder_token_id: int = 163605,
254
+ pad_token_id: int = 0,
255
+ **kwargs
256
+ ):
257
+ if vision_config is None:
258
+ vision_config = MoonViTConfig()
259
+ elif isinstance(vision_config, dict):
260
+ vision_config = MoonViTConfig(**vision_config)
261
+ self.vision_config = vision_config
262
+
263
+ if text_config is None:
264
+ text_config = DeepseekV3Config()
265
+ elif isinstance(text_config, dict):
266
+ text_config = DeepseekV3Config(**text_config)
267
+ self.text_config = text_config
268
+
269
+ self.ignore_index = ignore_index
270
+ self.media_placeholder_token_id = media_placeholder_token_id
271
+
272
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
figures/arch.png ADDED

Git LFS Details

  • SHA256: 5195d9f99c08f7e135eedb19cf370d92c36b7b3387e9c1b7cad5e24990a0d6d0
  • Pointer size: 131 Bytes
  • Size of remote file: 641 kB
figures/demo.png ADDED

Git LFS Details

  • SHA256: 95de8765da89c41a2421f1c1fa3986e4d3c83793d92c8ade460a142b329d04c1
  • Pointer size: 131 Bytes
  • Size of remote file: 525 kB
figures/instruct_perf.png ADDED

Git LFS Details

  • SHA256: 52405bbe3e3b0c30a5502c40095241000d4c3dbf9862d7a6c0b6079806292ad4
  • Pointer size: 132 Bytes
  • Size of remote file: 2.24 MB
figures/logo.png ADDED

Git LFS Details

  • SHA256: 7870b48105beb49cdb29bb3090abb7bbca688bef862507904c23d9c472df221c
  • Pointer size: 130 Bytes
  • Size of remote file: 13.1 kB
image_processing_kimi_vl.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Image processor class for KimiVL."""
2
+
3
+ import math
4
+ import numpy as np
5
+ from PIL import Image
6
+ from typing import Optional, Union
7
+
8
+ import torch
9
+ from torchvision.transforms import functional as TF
10
+ from transformers.image_utils import ImageInput, make_list_of_images, valid_images
11
+ from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
12
+ from transformers.utils import TensorType
13
+
14
+
15
+ OPENAI_DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073)
16
+ OPENAI_DATASET_STD = (0.26862954, 0.26130258, 0.27577711)
17
+
18
+
19
+ class KimiVLImageProcessor(BaseImageProcessor):
20
+ model_type = "kimi_vl"
21
+
22
+ def __init__(
23
+ self,
24
+ patch_size: int = 14,
25
+ pad_input: bool = False,
26
+ image_mean: tuple[float, float, float] = OPENAI_DATASET_MEAN,
27
+ image_std: tuple[float, float, float] = OPENAI_DATASET_STD,
28
+ in_token_limit: int = 4096,
29
+ merge_kernel_size: list[int, int] = [2, 2],
30
+ **kwargs,
31
+ ):
32
+ super().__init__(**kwargs)
33
+ self.in_token_limit = in_token_limit
34
+ self.patch_size = patch_size
35
+ self.pad_input = pad_input
36
+ self.image_mean = image_mean
37
+ self.image_std = image_std
38
+ self.merge_kernel_size = merge_kernel_size
39
+
40
+ def rescale(
41
+ self, image: Image.Image, merge_kernel_size: list[int, int] = [2, 2]
42
+ ) -> Image.Image:
43
+ w, h = image.size
44
+ patch_size = self.patch_size
45
+
46
+ if (w // patch_size) * (h // patch_size) > self.in_token_limit:
47
+ scale = math.sqrt(self.in_token_limit / ((w // patch_size) * (h // patch_size)))
48
+ new_w, new_h = int(w * scale), int(h * scale)
49
+ image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
50
+ if self.pad_input:
51
+ new_w, new_h = image.size
52
+ pad_size_h = merge_kernel_size[0] * patch_size
53
+ pad_size_w = merge_kernel_size[1] * patch_size
54
+
55
+ pad_h = (pad_size_h - new_h % pad_size_h) % pad_size_h
56
+ pad_w = (pad_size_w - new_w % pad_size_w) % pad_size_w
57
+
58
+ image = TF.pad(image, (0, 0, pad_w, pad_h))
59
+ else:
60
+ new_w, new_h = image.size
61
+ new_w = new_w - new_w % patch_size
62
+ new_h = new_h - new_h % patch_size
63
+ image = TF.center_crop(image, (new_h, new_w))
64
+
65
+ w, h = image.size
66
+ if w // patch_size >= 512 or h // patch_size >= 512:
67
+ raise ValueError("Exceed pos emb")
68
+
69
+ return image
70
+
71
+ def to_tensor(self, image: Image.Image) -> torch.Tensor:
72
+ return TF.to_tensor(image.convert("RGB"))
73
+
74
+ def normalize(self, image: torch.Tensor) -> torch.Tensor:
75
+ return TF.normalize(image, self.image_mean, self.image_std)
76
+
77
+ def patchify(self, image: torch.Tensor) -> tuple[torch.Tensor, list[int, int]]:
78
+ patch_size = self.patch_size
79
+ C, H, W = image.shape
80
+ patches = image.reshape(C, H // patch_size, patch_size, W // patch_size, patch_size)
81
+ patches = patches.permute(1, 3, 0, 2, 4)
82
+ patches = patches.contiguous().view(-1, C, patch_size, patch_size)
83
+ grid_hw = (H // patch_size, W // patch_size)
84
+ return patches, grid_hw
85
+
86
+ def _preprocess(self, image: ImageInput) -> tuple[torch.Tensor, list[int, int]]:
87
+ """
88
+ Preprocess image and patchify it.
89
+
90
+ Args:
91
+ image (`ImageInput`):
92
+ Image to preprocess. Expects pixel values ranging from 0 to 255. If pixel values range from 0 to 1, set `do_rescale=False`.
93
+
94
+ Returns:
95
+ patches: torch.Tensor
96
+ grid_hw: list[int, int]
97
+ """
98
+ image = self.rescale(image, self.merge_kernel_size)
99
+ image = self.to_tensor(image)
100
+ image = self.normalize(image)
101
+ patches, grid_hw = self.patchify(image)
102
+ return patches, grid_hw
103
+
104
+ def preprocess(
105
+ self,
106
+ images: ImageInput,
107
+ return_tensors: Optional[Union[str, TensorType]] = None,
108
+ ) -> BatchFeature:
109
+ images = make_list_of_images(images)
110
+
111
+ if not valid_images(images):
112
+ raise ValueError(
113
+ "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
114
+ "torch.Tensor, tf.Tensor or jax.ndarray."
115
+ )
116
+
117
+ pixel_values, image_grid_hws = [], []
118
+ for image in images:
119
+ patches, image_grid_hw = self._preprocess(image)
120
+ pixel_values.append(patches)
121
+ image_grid_hws.append(image_grid_hw)
122
+ pixel_values = torch.concat(pixel_values, dim=0)
123
+ image_grid_hws = np.array(image_grid_hws)
124
+ data = {"pixel_values": pixel_values, "image_grid_hws": image_grid_hws}
125
+
126
+ return BatchFeature(data=data, tensor_type=return_tensors)
model-00001-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5ef3ebd9727f82e34417a778317d2cc9c08762fe0bc4a2ee333b8a52cf7c1a5
3
+ size 4994390288
model-00002-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:45ecd00decdad65e7d3f494028ed0c79d0cd56f145adae077ffa78b5b8ff95c0
3
+ size 4995061424
model-00003-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b9ba98f01e22eea43da8cfbf6f09ff0857616cd9e1df4603a735c111755109ef
3
+ size 4996100112
model-00004-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f7eb3fc5c12481fd1a81d2708fa4a299ff4c207a5dbafb5b8bef25ab9fd8b23
3
+ size 4996100320
model-00005-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db83e896b3d75f4bc51f621ae90ad61e8f6f7901f49a06b5c40348b839057206
3
+ size 4998185720
model-00006-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abed41982a3f9c7d05f69bb560dad7bcb93cfa764c77a2b59127f35dc787983c
3
+ size 4996099448
model-00007-of-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e2ddd33f2b4f472898482860585bd6d73d4397c8c833ed9d00a2024443f7a77
3
+ size 2840161216
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_kimi_vl.py ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoImageProcessor": "image_processing_kimi_vl.KimiVLImageProcessor",
4
+ "AutoProcessor": "processing_kimi_vl.KimiVLProcessor"
5
+ },
6
+ "in_token_limit": 4096,
7
+ "patch_size": 14,
8
+ "num_pooled_tokens": 1024,
9
+ "image_mean": [
10
+ 0.5,
11
+ 0.5,
12
+ 0.5
13
+ ],
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "pad_input": true
20
+ }
processing_kimi_vl.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2025 The Moonshot Team and HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # The code is based on the Qwen2VL processor (qwen2_vl/processing_qwen2_vl.py), but modified for KimiVL.
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License");
7
+ # you may not use this file except in compliance with the License.
8
+ # You may obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15
+ # See the License for the specific language governing permissions and
16
+ # limitations under the License.
17
+ """
18
+ Processor class for KimiVL.
19
+ """
20
+
21
+ from typing import List, Union
22
+
23
+ from transformers.feature_extraction_utils import BatchFeature
24
+ from transformers.image_utils import ImageInput
25
+ from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, _validate_images_text_input_order
26
+ from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
27
+ from transformers.utils import logging
28
+
29
+
30
+ logger = logging.get_logger(__name__)
31
+
32
+
33
+ class KimiVLProcessorKwargs(ProcessingKwargs, total=False):
34
+ _defaults = {
35
+ "text_kwargs": {
36
+ "padding": False,
37
+ },
38
+ "images_kwargs": {},
39
+ }
40
+
41
+
42
+ class KimiVLProcessor(ProcessorMixin):
43
+ r"""
44
+ Constructs a KimiVL processor which wraps a KimiVL image processor and a tokenizer into a single processor.
45
+
46
+ [`KimiVLProcessor`] offers all the functionalities of [`KimiVLImageProcessor`] and [`TikTokenTokenizer`]. See the
47
+ [`~KimiVLProcessor.__call__`] and [`~KimiVLProcessor.decode`] for more information.
48
+
49
+ Args:
50
+ image_processor ([`KimiVLImageProcessor`], *optional*):
51
+ The image processor is a required input.
52
+ tokenizer ([`TikTokenTokenizer`], *optional*):
53
+ The tokenizer is a required input.
54
+ chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
55
+ in a chat into a tokenizable string.
56
+ """
57
+
58
+ attributes = ["image_processor", "tokenizer"]
59
+ valid_kwargs = [ "chat_template"]
60
+ image_processor_class = "AutoImageProcessor"
61
+ tokenizer_class = "AutoTokenizer"
62
+
63
+ def __init__(
64
+ self,
65
+ image_processor=None,
66
+ tokenizer=None,
67
+ chat_template=None,
68
+ **kwargs,
69
+ ):
70
+ self.image_token = "<|media_pad|>"
71
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
72
+
73
+ def __call__(
74
+ self,
75
+ images: ImageInput = None,
76
+ text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
77
+ **kwargs: Unpack[KimiVLProcessorKwargs],
78
+ ) -> BatchFeature:
79
+ """
80
+ Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
81
+ and `kwargs` arguments to TikTokenTokenizer's [`~TikTokenTokenizer.__call__`] if `text` is not `None` to encode
82
+ the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
83
+ CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the docstring
84
+ of the above two methods for more information.
85
+
86
+ Args:
87
+ images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
88
+ The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
89
+ tensor. Both channels-first and channels-last formats are supported.
90
+ text (`str`, `List[str]`, `List[List[str]]`):
91
+ The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
92
+ (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
93
+ `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
94
+ return_tensors (`str` or [`~utils.TensorType`], *optional*):
95
+ If set, will return tensors of a particular framework. Acceptable values are:
96
+ - `'tf'`: Return TensorFlow `tf.constant` objects.
97
+ - `'pt'`: Return PyTorch `torch.Tensor` objects.
98
+ - `'np'`: Return NumPy `np.ndarray` objects.
99
+ - `'jax'`: Return JAX `jnp.ndarray` objects.
100
+
101
+ Returns:
102
+ [`BatchFeature`]: A [`BatchFeature`] with the following fields:
103
+
104
+ - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
105
+ - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
106
+ `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
107
+ `None`).
108
+ - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
109
+ """
110
+ if images is None and text is None:
111
+ raise ValueError("You have to specify at least one of `images` or `text`.")
112
+
113
+ # check if images and text inputs are reversed for BC
114
+ images, text = _validate_images_text_input_order(images, text)
115
+
116
+ output_kwargs = self._merge_kwargs(
117
+ KimiVLProcessorKwargs,
118
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
119
+ **kwargs,
120
+ )
121
+ if images is not None:
122
+ image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
123
+ image_grid_hws = image_inputs["image_grid_hws"]
124
+ else:
125
+ image_inputs = {}
126
+ image_grid_hws = None
127
+
128
+ if isinstance(text, str):
129
+ text = [text]
130
+ elif not isinstance(text, list) and not isinstance(text[0], str):
131
+ raise ValueError("Invalid input text. Please provide a string, or a list of strings")
132
+
133
+ if image_grid_hws is not None:
134
+ merge_length = self.image_processor.merge_kernel_size[0] * self.image_processor.merge_kernel_size[1]
135
+ index = 0
136
+ for i in range(len(text)):
137
+ while self.image_token in text[i]:
138
+ text[i] = text[i].replace(
139
+ self.image_token,
140
+ "<|placeholder|>" * (image_grid_hws[index].prod() // merge_length),
141
+ 1,
142
+ )
143
+ index += 1
144
+ text[i] = text[i].replace("<|placeholder|>", self.image_token)
145
+
146
+ text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
147
+ return BatchFeature(data={**text_inputs, **image_inputs})
148
+
149
+ def batch_decode(self, *args, **kwargs):
150
+ """
151
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
152
+ refer to the docstring of this method for more information.
153
+ """
154
+ return self.tokenizer.batch_decode(*args, **kwargs)
155
+
156
+ def decode(self, *args, **kwargs):
157
+ """
158
+ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
159
+ the docstring of this method for more information.
160
+ """
161
+ return self.tokenizer.decode(*args, **kwargs)
162
+
163
+ @property
164
+ def model_input_names(self):
165
+ tokenizer_input_names = self.tokenizer.model_input_names
166
+ image_processor_input_names = self.image_processor.model_input_names
167
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
168
+
169
+
170
+ __all__ = ["KimiVLProcessorKwargs"]
tiktoken.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b6c497a7469b33ced9c38afb1ad6e47f03f5e5dc05f15930799210ec050c5103
3
+ size 2795286
tokenization_moonshot.py ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tiktoken
3
+
4
+ from logging import getLogger
5
+ from pathlib import Path
6
+ from typing import (
7
+ cast,
8
+ Tuple,
9
+ Dict,
10
+ Iterator,
11
+ List,
12
+ Union,
13
+ Optional,
14
+ )
15
+ from shutil import copyfile
16
+ from tiktoken.load import load_tiktoken_bpe
17
+ from tokenizers import AddedToken
18
+ from transformers.tokenization_utils import PreTrainedTokenizer
19
+ from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
20
+
21
+
22
+ logger = getLogger(__name__)
23
+ VOCAB_FILES_NAMES = {"vocab_file": "tiktoken.model"}
24
+ SPIECE_UNDERLINE = "▁"
25
+
26
+
27
+ class TikTokenTokenizer(PreTrainedTokenizer):
28
+ """
29
+ Tokenizing and encoding/decoding text using the Tiktoken tokenizer. See megatron/tokenizer/tiktoken_tokenizer.py.
30
+
31
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
32
+ this superclass for more information regarding those methods.
33
+
34
+ Args:
35
+ vocab_file (`str`):
36
+ The path to the Tiktoken model file.
37
+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|begin_of_text|>",`):
38
+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
39
+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|end_of_text|>"`):
40
+ The end of sequence token.
41
+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_249|>"`):
42
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
43
+ token instead. The second to last item in special_tokens.
44
+ pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_250|>"`):
45
+ The token used for padding, for example when batching sequences of different lengths.
46
+ additional_special_tokens (list of `str`, *optional*):
47
+ A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
48
+ skipped when decoding if `skip_special_tokens` is set to `True`.
49
+ """
50
+
51
+ vocab_files_names = VOCAB_FILES_NAMES
52
+
53
+ model_input_names = ["input_ids", "attention_mask"]
54
+
55
+ special_tokens: Dict[str, int]
56
+
57
+ num_reserved_special_tokens = 256
58
+
59
+ pat_str = "|".join(
60
+ [
61
+ r"""[\p{Han}]+""",
62
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
63
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
64
+ r"""\p{N}{1,3}""",
65
+ r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
66
+ r"""\s*[\r\n]+""",
67
+ r"""\s+(?!\S)""",
68
+ r"""\s+""",
69
+ ]
70
+ )
71
+
72
+ def __init__(
73
+ self,
74
+ vocab_file,
75
+ bos_token: Union[str, AddedToken] = "[BOS]",
76
+ eos_token: Union[str, AddedToken] = "[EOS]",
77
+ unk_token: Union[str, AddedToken] = "[UNK]",
78
+ pad_token: Union[str, AddedToken] = "[PAD]",
79
+ additional_special_tokens: Optional[List[str]] = None,
80
+ added_tokens_decoder: Optional[dict] = None,
81
+ **kwargs,
82
+ ):
83
+ assert os.path.isfile(vocab_file), vocab_file
84
+ if additional_special_tokens is None:
85
+ additional_special_tokens = [
86
+ "<|im_end|>",
87
+ "<|im_middle|>",
88
+ "<|im_user|>",
89
+ "<|im_assistant|>",
90
+ "<|im_system|>",
91
+ ]
92
+ special_tokens_mapping = {
93
+ i: added_tokens_decoder[i].content for i in added_tokens_decoder
94
+ }
95
+
96
+ special_tokens = (
97
+ [str(bos_token), str(eos_token)]
98
+ + additional_special_tokens
99
+ + [str(unk_token), str(pad_token)]
100
+ )
101
+
102
+ self.vocab_file = vocab_file
103
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
104
+ num_base_tokens = len(mergeable_ranks)
105
+ self.special_tokens = {
106
+ special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
107
+ for i in range(
108
+ num_base_tokens, num_base_tokens + self.num_reserved_special_tokens + 2
109
+ )
110
+ }
111
+
112
+ self.model = tiktoken.Encoding(
113
+ name=Path(vocab_file).name,
114
+ pat_str=self.pat_str,
115
+ mergeable_ranks=mergeable_ranks,
116
+ special_tokens=self.special_tokens,
117
+ )
118
+ logger.info(f"Reloaded tiktoken model from {vocab_file}")
119
+
120
+ self.n_words: int = self.model.n_vocab
121
+ # BOS / EOS token IDs
122
+ self.bos_id: int = self.special_tokens[str(bos_token)]
123
+ self.eos_id: int = self.special_tokens[str(eos_token)]
124
+ logger.info(
125
+ f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
126
+ )
127
+
128
+ self.pad_id: int = self.special_tokens[str(pad_token)]
129
+ self.unk_id: int = self.special_tokens[str(unk_token)]
130
+
131
+ self.byte_encoder = bytes_to_unicode()
132
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
133
+
134
+ self.decoder = {}
135
+ for i in range(self.n_words):
136
+ # Taken from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
137
+ decoding = "".join(
138
+ [
139
+ self.byte_encoder[ord(char)]
140
+ for char in self.model.decode_single_token_bytes(i).decode(
141
+ "latin-1"
142
+ )
143
+ ]
144
+ )
145
+ self.decoder[i] = decoding
146
+
147
+ self.encoder = {}
148
+ for i in range(self.n_words):
149
+ if i in self.decoder:
150
+ self.encoder[self.decoder[i]] = i
151
+
152
+ super().__init__(
153
+ bos_token=bos_token,
154
+ eos_token=eos_token,
155
+ unk_token=unk_token,
156
+ pad_token=pad_token,
157
+ additional_special_tokens=additional_special_tokens,
158
+ **kwargs,
159
+ )
160
+ self.all_special_ids_set = set(self.all_special_ids)
161
+
162
+ def encode(
163
+ self, text: str, allow_special_tokens: bool = True, **kwargs
164
+ ) -> List[int]:
165
+ """
166
+ Encodes a string into a list of token IDs.
167
+
168
+ Args:
169
+ text (str): The input string to be encoded.
170
+
171
+ Returns:
172
+ list[int]: A list of token IDs.
173
+ """
174
+ # If there are other args, we should call super().encode because there are a lot of code
175
+ # to handle those args. supper().encode finally will call _tokenize and _convert_token_to_id.
176
+ if len(kwargs) > 0:
177
+ return super().encode(text, **kwargs)
178
+
179
+ assert type(text) is str
180
+
181
+ # The tiktoken tokenizer can handle <=400k chars without
182
+ # pyo3_runtime.PanicException.
183
+ TIKTOKEN_MAX_ENCODE_CHARS = 400_000
184
+
185
+ # https://github.com/openai/tiktoken/issues/195
186
+ # Here we iterate over subsequences and split if we exceed the limit
187
+ # of max consecutive non-whitespace or whitespace characters.
188
+ MAX_NO_WHITESPACES_CHARS = 25_000
189
+
190
+ substrs = (
191
+ substr
192
+ for i in range(0, len(text), TIKTOKEN_MAX_ENCODE_CHARS)
193
+ for substr in self._split_whitespaces_or_nonwhitespaces(
194
+ text[i : i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
195
+ )
196
+ )
197
+ t: List[int] = []
198
+ for substr in substrs:
199
+ if allow_special_tokens:
200
+ t.extend(
201
+ # we should consider special token as a common token
202
+ self.model.encode(
203
+ substr,
204
+ allowed_special="all",
205
+ )
206
+ )
207
+ else:
208
+ t.extend(
209
+ # we should consider special token as a common token
210
+ self.model.encode(
211
+ substr,
212
+ disallowed_special=(),
213
+ )
214
+ )
215
+ return t
216
+
217
+ def decode(self, token_ids: Union[int, List[int]], **kwargs) -> str:
218
+ """
219
+ Decodes a list of token IDs into a string.
220
+
221
+ Args:
222
+ t (List[int]): The list of token IDs to be decoded.
223
+
224
+ Returns:
225
+ str: The decoded string.
226
+ """
227
+ # If there are other args, we should call super().decode because there are a lot of code
228
+ # to handle those args. supper().encode finally will call convert_tokens_to_string and _convert_id_to_token.
229
+ if len(kwargs) > 0:
230
+ return super().decode(token_ids, **kwargs)
231
+
232
+ if type(token_ids) is int:
233
+ token_ids = [token_ids]
234
+
235
+ return self.model.decode(cast(List[int], token_ids))
236
+
237
+ @staticmethod
238
+ def _split_whitespaces_or_nonwhitespaces(
239
+ s: str, max_consecutive_slice_len: int
240
+ ) -> Iterator[str]:
241
+ """
242
+ Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
243
+ consecutive whitespaces or consecutive non-whitespaces.
244
+ """
245
+ current_slice_len = 0
246
+ current_slice_is_space = s[0].isspace() if len(s) > 0 else False
247
+ slice_start = 0
248
+
249
+ for i in range(len(s)):
250
+ is_now_space = s[i].isspace()
251
+
252
+ if current_slice_is_space ^ is_now_space:
253
+ current_slice_len = 1
254
+ current_slice_is_space = is_now_space
255
+ else:
256
+ current_slice_len += 1
257
+ if current_slice_len > max_consecutive_slice_len:
258
+ yield s[slice_start:i]
259
+ slice_start = i
260
+ current_slice_len = 1
261
+ yield s[slice_start:]
262
+
263
+ """ ----- Below are the abstract methods required by PreTrainedTokenizer ----- """
264
+
265
+ @property
266
+ def vocab_size(self) -> int:
267
+ return self.n_words
268
+
269
+ def get_vocab(self) -> Dict[str, int]:
270
+ return self.encoder
271
+
272
+ def _tokenize(self, text: str, **kwargs) -> List[str]:
273
+ return [self.decoder[t] for t in self.encode(text)]
274
+
275
+ def _convert_token_to_id(self, token: str) -> int:
276
+ return self.encoder.get(token, self.unk_id)
277
+
278
+ def _convert_id_to_token(self, index: int) -> str:
279
+ return self.decoder.get(index)
280
+
281
+ @staticmethod
282
+ def clean_up_tokenization(out_string: str) -> str:
283
+ return out_string
284
+
285
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
286
+ text = "".join(tokens).replace(SPIECE_UNDERLINE, "")
287
+ text = bytearray([self.byte_decoder[c] for c in text]).decode(
288
+ "utf-8", "replace"
289
+ )
290
+ return text
291
+
292
+ def save_vocabulary(
293
+ self, save_directory: str, filename_prefix: Optional[str] = None
294
+ ) -> Tuple[str]:
295
+ if not os.path.isdir(save_directory):
296
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
297
+ return
298
+ out_vocab_file = os.path.join(
299
+ save_directory,
300
+ (filename_prefix + "-" if filename_prefix else "")
301
+ + VOCAB_FILES_NAMES["vocab_file"],
302
+ )
303
+
304
+ if os.path.abspath(self.vocab_file) != os.path.abspath(
305
+ out_vocab_file
306
+ ) and os.path.isfile(self.vocab_file):
307
+ copyfile(self.vocab_file, out_vocab_file)
308
+
309
+ return (out_vocab_file,)
tokenizer_config.json ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "163584": {
4
+ "content": "[BOS]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "163585": {
12
+ "content": "[EOS]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "163586": {
20
+ "content": "<|im_end|>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "163601": {
28
+ "content": "<|im_middle|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "163587": {
36
+ "content": "<|im_user|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "163588": {
44
+ "content": "<|im_assistant|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "163594": {
52
+ "content": "<|im_system|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "163602": {
60
+ "content": "<|media_start|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "163603": {
68
+ "content": "<|media_content|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "163604": {
76
+ "content": "<|media_end|>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "163605": {
84
+ "content": "<|media_pad|>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "163838": {
92
+ "content": "[PAD]",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "163839": {
100
+ "content": "[UNK]",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ }
107
+ },
108
+ "additional_special_tokens": [
109
+ "<|im_end|>",
110
+ "<|im_user|>",
111
+ "<|im_assistant|>",
112
+ "<|im_system|>",
113
+ "<|im_middle|>",
114
+ "<|media_start|>",
115
+ "<|media_content|>",
116
+ "<|media_end|>",
117
+ "<|media_pad|>"
118
+ ],
119
+ "bos_token": "[BOS]",
120
+ "clean_up_tokenization_spaces": false,
121
+ "eos_token": "[EOS]",
122
+ "extra_special_tokens": {},
123
+ "model_max_length": 1048576,
124
+ "pad_token": "[PAD]",
125
+ "unk_token": "[UNK]",
126
+ "tokenizer_class": "TikTokenTokenizer",
127
+ "chat_template": "{%- for message in messages -%}{%- if loop.first and messages[0]['role'] != 'system' -%}{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}{%- endif -%}{%- if message['role'] == 'system' -%}{{'<|im_system|>'}}{%- endif -%}{%- if message['role'] == 'user' -%}{{'<|im_user|>'}}{%- endif -%}{%- if message['role'] == 'assistant' -%}{{'<|im_assistant|>'}}{%- endif -%}{{- message['role'] -}}{{'<|im_middle|>'}}{%- if message['content'] is string -%}{{- message['content'] + '<|im_end|>' -}}{%- else -%}{%- for content in message['content'] -%}{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}{{'<|media_start|>image<|media_content|><|media_pad|><|media_end|>'}}{%- else -%}{{content['text']}}{%- endif -%}{%- endfor -%}{{'<|im_end|>'}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{'<|im_assistant|>assistant<|im_middle|>'}}{%- endif -%}",
128
+ "auto_map": {
129
+ "AutoTokenizer": [
130
+ "tokenization_moonshot.TikTokenTokenizer",
131
+ null
132
+ ]
133
+ }
134
+ }