moonshotai
/

Kimi-K2-Instruct

@@ -1,17 +1,11 @@
-<!-- ---
-library_name: transformers
---- -->
-<!-- markdownlint-disable first-line-h1 -->
-<!-- markdownlint-disable html -->
-<!-- markdownlint-disable no-duplicate-header -->
 <div align="center">
   <picture>
       <img src="figures/kimi-logo.png" width="30%" alt="Kimi K2: Open Agentic Intellignece">
   </picture>
 </div>
 <hr>
 <div align="center" style="line-height:1">
   <a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2-ff6b6b?color=1783ff&logoColor=white"/></a>
   <a href="https://www.moonshot.ai" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Moonshot%20AI-white?logo=Kimi&logoColor=white"/></a>
@@ -31,8 +25,6 @@ library_name: transformers
 <b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2/">Tech Blog</a></b> &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; <b>📄&nbsp;&nbsp;Paper Link (comming soon)</b>
 </p>
-## 0. Reminder: Remove this after you squash the commit history before release.
 ## 1. Model Introduction
 Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.
@@ -47,11 +39,6 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
-<p align="center">
-  TODO this is a banner
-  <img width="80%" src="figures/logo.svg">
-</p>
 ## 2. Model Summary
 <div align="center">
@@ -86,13 +73,13 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <tr>
 <th align="center">Benchmark</th>
 <th align="center">Metric</th>
-<th align="center">Kimi K2 Instruct</th>
-<th align="center">DeepSeek-V3-0324</th>
-<th align="center">Qwen3-235B-A22B <br><sup>(non-thinking)</sup></th>
-<th align="center">Claude Sonnet 4 <br><sup>(w/o extended thinking)</sup></th>
-<th align="center">Claude Opus 4 <br><sup>(w/o extended thinking)</sup></th>
-<th align="center">GPT-4.1</th>
-<th align="center">Gemini 2.5 Flash <br> Preview (05-20)</th>
 </tr>
 </thead>
 <tbody>
@@ -106,7 +93,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">46.9</td>
 <td align="center">37.0</td>
 <td align="center">48.5</td>
-<td align="center">47.4</t6
 <td align="center">44.7</td>
 <td align="center">44.7</td>
 </tr>
@@ -121,10 +108,11 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">19.5</td>
 <td align="center">19.5</td>
 </tr>
 <tr>
 <td align="center">MultiPL-E</td>
 <td align="center">Pass@1</td>
-<td align="center"><ins><strong>86.7</strong></ins></td>
 <td align="center">83.1</td>
 <td align="center">78.2</td>
 <td align="center">88.6</td>
@@ -132,6 +120,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">86.7</td>
 <td align="center">85.6</td>
 </tr>
 <tr>
 <td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
 <td align="center">Single Patch</td>
@@ -143,18 +132,19 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">40.8</td>
 <td align="center">32.6</td>
 </tr>
 <tr>
 <td align="center" rowspan="2">SWE-bench Verified <br/> <sup>(Agentic Coding)</sup></td>
 <td align="center">Single Attempt (Acc)</td>
 <td align="center"><ins><strong>65.8</strong></ins></td>
 <td align="center">38.8</td>
 <td align="center">34.4</td>
-<td align="center"><strong>72.7</strong></td>
 <td align="center">72.5<sup>*</sup></td>
 <td align="center">54.6</td>
 <td align="center">—</td>
 </tr>
 <tr>
 <!--<td align="center">(Agentic Coding)</td>-->
 <td align="center">Multiple Attempts (Acc)</td>
@@ -168,7 +158,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 </tr>
 <tr>
-<td align="center" rowspan="2">SWE-bench Multilingual<br /> <sup>(Agentic Coding)</sup></td>
 <td align="center">Single Attempt (Acc)</td>
 <td align="center"><ins><strong>47.3</strong> </ins></td>
 <td align="center">25.8</td>
@@ -178,23 +168,25 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">31.5</td>
 <td align="center">—</td>
 </tr>
 <tr>
-<!--<td align="center">(Agentic Coding)</td>-->
 <td align="center">Inhouse Framework (Acc)</td>
-<td align="center"><ins><strong>30.0</strong> </ins></td>
 <td align="center">—</td>
 <td align="center">—</td>
 <td align="center">35.5</td>
 <td align="center"><strong>43.2</strong></td>
-<td align="center">8.30</td>
 <td align="center">—</td>
 </tr>
 <tr>
-<td align="center">TerminalBench</td>
 <td align="center">Acc</td>
 <td align="center"><ins><strong>25.0</strong> </ins></td>
 <td align="center">16.3</td>
-<td align="center">6.60</td>
 <td align="center">—</td>
 <td align="center">—</td>
 <td align="center"><strong>30.3</strong></td>
@@ -291,7 +283,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">91.2<sup>*</sup></td>
 <td align="center">94.0</td>
 <td align="center">94.4</td>
-<td align="center">92.4/td>
 <td align="center">95.4</td>
 </tr>
 <tr>
@@ -301,7 +293,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">27.5</td>
 <td align="center">11.9</td>
 <td align="center">15.9</td>
-<td align="center">15.8</td>
 <td align="center">19.4</td>
 <td align="center">34.7</td>
 </tr>
@@ -333,7 +325,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 <td align="center">Acc</td>
 <td align="center"><strong>89.0</strong></td>
 <td align="center">84.0</td>
-<td align="center">37.7</td>
 <td align="center">73.7</td>
 <td align="center">59.3</td>
 <td align="center">58.5</td>
@@ -377,8 +369,8 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 </tr>
 <tr>
-<td align="center">Humanity’s Last</td>
-<td align="center">(Text Only)</td>
 <td align="center">4.7</td>
 <td align="center">5.2</td>
 <td align="center"><ins><strong>5.7</strong></ins></td>
@@ -491,7 +483,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 </sup><br/><sup>
 • Some data points have been omitted due to prohibitively expensive evaluation costs.
     </sup>
 ---
 #### Base model evaluation results
@@ -501,22 +493,22 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
 | Benchmark         | Metric   | Shot    | Kimi K2 Base | Deepseek-V3-Base | Qwen2.5-72B | Llama 4 Maverick |
 |:-------------------:|:----------:|:---------:|:--------------:|:------------------:|:-------------:|:------------------:|
 | **General Tasks** |          |         |              |                  |             |                  |
-| MMLU              | EM       | 5-shot  | **87.79**    | 87.1             | 86.08       | 84.87            |
-| MMLU-pro          | EM       | 5-shot  | **69.17**    | 60.59            | 62.8        | 63.47            |
-| MMLU-redux-2.0    | EM       | 5-shot  | **90.17**    | 89.53            | 87.77       | 88.18            |
-| SimpleQA          | Correct  | 5-shot  | **35.25**    | 26.49            | 10.31       | 23.74            |
-| TriviaQA          | EM       | 5-shot  | **85.09**    | 84.11            | 76.03       | 79.25            |
-| GPQA-Diamond      | Avg@8    | 5-shot  | 48.11        | **50.51**        | 40.78       | 49.43            |
-| SuperGPQA         | EM       | 5-shot  | **44.67**    | 39.2             | 34.23       | 38.84            |
 | **Code Tasks**    |          |         |              |                  |             |                  |
-| LiveCodeBench v6  | Pass@1   | 1-shot  | **26.29**    | 22.86            | 21.14       | 25.14            |
-| EvalPlus          | Pass@1   | -       | **80.33**    | 65.61            | 66.04       | 65.48            |
 | **Mathematics Tasks** |      |         |              |                  |             |                  |
-| MATH              | EM       | 4-shot  | **70.22**    | 60.06            | 60.96       | 63.02            |
-| GSM8k             | EM       | 8-shot  | **92.12**    | 91.66            | 90.37       | 86.35            |
 | **Chinese Tasks** |          |         |              |                  |             |                  |
-| C-Eval            | EM       | 5-shot  | **92.5**     | 90.04            | 90.86       | 80.91            |
-| CSimpleQA         | Correct  | 5-shot  | **77.57**    | 72.13            | 50.53       | 53.47            |
 </div>
 <sup>
@@ -537,12 +529,12 @@ Our model checkpoints are stored in the block-fp8 format, you can find it on [Hu
 Currently, Kimi-K2 is recommended to run on the following inference engines:
-* vLLM
 * SGLang
 * KTransformers
-* TensorRT-LLM
-Deployment examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
 ---
@@ -568,7 +560,7 @@ def simple_chat(client: OpenAI, model_name: str):
     print(response.choices[0].message.content)
 ```
-> [!NOTE]
 > The recommended temperature for Kimi-K2-Instruct is `temperature = 0.6`.
 > If no special instructions are required, the system prompt above is a good default.
@@ -576,7 +568,7 @@ def simple_chat(client: OpenAI, model_name: str):
 ### Tool Calling
-Kimi-K2-Instruct has strong tool-calling capabilities.
 To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
 The following example demonstrates calling a weather tool end-to-end:
@@ -645,8 +637,8 @@ def tool_call_with_client(client: OpenAI, model_name: str):
     print(choice.message.content)
 ```
-The `tool_call_with_client` function implements the pipeline from user query to tool execution.
-This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic.
 For streaming output and manual tool-parsing, see the [Tool Calling Guide](docs/tool_call_guidance.md).
 ---
@@ -655,9 +647,6 @@ For streaming output and manual tool-parsing, see the [Tool Calling Guide](docs/
 Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
-In short, it is MIT License for most people, but you need to give credit to "Kimi K2" by displaying it prominently in your product, if you have more than 100 million monthly active users or annual revenue exceeding 20 million USD.
 ---
 ## 7. Contact Us

 <div align="center">
   <picture>
       <img src="figures/kimi-logo.png" width="30%" alt="Kimi K2: Open Agentic Intellignece">
   </picture>
 </div>
 <hr>
 <div align="center" style="line-height:1">
   <a href="https://www.kimi.com" target="_blank"><img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-Kimi%20K2-ff6b6b?color=1783ff&logoColor=white"/></a>
   <a href="https://www.moonshot.ai" target="_blank"><img alt="Homepage" src="https://img.shields.io/badge/Homepage-Moonshot%20AI-white?logo=Kimi&logoColor=white"/></a>
 <b>📰&nbsp;&nbsp;<a href="https://moonshotai.github.io/Kimi-K2/">Tech Blog</a></b> &nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp; <b>📄&nbsp;&nbsp;Paper Link (comming soon)</b>
 </p>
 ## 1. Model Introduction
 Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.
 - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
 ## 2. Model Summary
 <div align="center">
 <tr>
 <th align="center">Benchmark</th>
 <th align="center">Metric</th>
+<th align="center"><sup>Kimi K2 Instruct</sup></th>
+<th align="center"><sup>DeepSeek-V3-0324</sup></th>
+<th align="center"><sup>Qwen3-235B-A22B <br><sup>(non-thinking)</sup></sup></th>
+<th align="center"><sup>Claude Sonnet 4 <br><sup>(w/o extended thinking)</sup></sup></th>
+<th align="center"><sup>Claude Opus 4 <br><sup>(w/o extended thinking)</sup></sup></th>
+<th align="center"><sup>GPT-4.1</sup></th>
+<th align="center"><sup>Gemini 2.5 Flash <br> Preview (05-20)</sup></th>
 </tr>
 </thead>
 <tbody>
 <td align="center">46.9</td>
 <td align="center">37.0</td>
 <td align="center">48.5</td>
+<td align="center">47.4</td>
 <td align="center">44.7</td>
 <td align="center">44.7</td>
 </tr>
 <td align="center">19.5</td>
 <td align="center">19.5</td>
 </tr>
 <tr>
 <td align="center">MultiPL-E</td>
 <td align="center">Pass@1</td>
+<td align="center"><ins><strong>85.7</strong></ins></td>
 <td align="center">83.1</td>
 <td align="center">78.2</td>
 <td align="center">88.6</td>
 <td align="center">86.7</td>
 <td align="center">85.6</td>
 </tr>
 <tr>
 <td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
 <td align="center">Single Patch</td>
 <td align="center">40.8</td>
 <td align="center">32.6</td>
 </tr>
 <tr>
 <td align="center" rowspan="2">SWE-bench Verified <br/> <sup>(Agentic Coding)</sup></td>
 <td align="center">Single Attempt (Acc)</td>
 <td align="center"><ins><strong>65.8</strong></ins></td>
 <td align="center">38.8</td>
 <td align="center">34.4</td>
+<td align="center"><strong>72.7</strong><sup>*</sup></td>
 <td align="center">72.5<sup>*</sup></td>
 <td align="center">54.6</td>
 <td align="center">—</td>
 </tr>
 <tr>
 <!--<td align="center">(Agentic Coding)</td>-->
 <td align="center">Multiple Attempts (Acc)</td>
 </tr>
 <tr>
+<td align="center">SWE-bench Multilingual<br /> <sup>(Agentic Coding)</sup></td>
 <td align="center">Single Attempt (Acc)</td>
 <td align="center"><ins><strong>47.3</strong> </ins></td>
 <td align="center">25.8</td>
 <td align="center">31.5</td>
 <td align="center">—</td>
 </tr>
 <tr>
+<td align="center" rowspan="2">TerminalBench</td>
 <td align="center">Inhouse Framework (Acc)</td>
+<td align="center"><ins><strong>30.0</strong></ins></td>
 <td align="center">—</td>
 <td align="center">—</td>
 <td align="center">35.5</td>
 <td align="center"><strong>43.2</strong></td>
+<td align="center">8.3</td>
 <td align="center">—</td>
 </tr>
 <tr>
+<!--<td align="center">TerminalBench</td>-->
 <td align="center">Acc</td>
 <td align="center"><ins><strong>25.0</strong> </ins></td>
 <td align="center">16.3</td>
+<td align="center">6.6</td>
 <td align="center">—</td>
 <td align="center">—</td>
 <td align="center"><strong>30.3</strong></td>
 <td align="center">91.2<sup>*</sup></td>
 <td align="center">94.0</td>
 <td align="center">94.4</td>
+<td align="center">92.4</td>
 <td align="center">95.4</td>
 </tr>
 <tr>
 <td align="center">27.5</td>
 <td align="center">11.9</td>
 <td align="center">15.9</td>
+<td align="center">15.9</td>
 <td align="center">19.4</td>
 <td align="center">34.7</td>
 </tr>
 <td align="center">Acc</td>
 <td align="center"><strong>89.0</strong></td>
 <td align="center">84.0</td>
+<td align="center">37.7<sup>*</sup></td>
 <td align="center">73.7</td>
 <td align="center">59.3</td>
 <td align="center">58.5</td>
 </tr>
 <tr>
+<td align="center">Humanity's Last Exam<br><sup>(Text Only)</sup></td>
+<td align="center">-</td>
 <td align="center">4.7</td>
 <td align="center">5.2</td>
 <td align="center"><ins><strong>5.7</strong></ins></td>
 </sup><br/><sup>
 • Some data points have been omitted due to prohibitively expensive evaluation costs.
     </sup>
 ---
 #### Base model evaluation results
 | Benchmark         | Metric   | Shot    | Kimi K2 Base | Deepseek-V3-Base | Qwen2.5-72B | Llama 4 Maverick |
 |:-------------------:|:----------:|:---------:|:--------------:|:------------------:|:-------------:|:------------------:|
 | **General Tasks** |          |         |              |                  |             |                  |
+| MMLU              | EM       | 5-shot  | **87.8**    | 87.1             | 86.1       | 84.9            |
+| MMLU-pro          | EM       | 5-shot  | **69.2**    | 60.6            | 62.8        | 63.5            |
+| MMLU-redux-2.0    | EM       | 5-shot  | **90.2**    | 89.5            | 87.8       | 88.2            |
+| SimpleQA          | Correct  | 5-shot  | **35.3**    | 26.5            | 10.3       | 23.7            |
+| TriviaQA          | EM       | 5-shot  | **85.1**    | 84.1            | 76.0       | 79.3            |
+| GPQA-Diamond      | Avg@8    | 5-shot  | 48.1        | **50.5**        | 40.8       | 49.4            |
+| SuperGPQA         | EM       | 5-shot  | **44.7**    | 39.2             | 34.2       | 38.8            |
 | **Code Tasks**    |          |         |              |                  |             |                  |
+| LiveCodeBench v6  | Pass@1   | 1-shot  | **26.3**    | 22.9            | 21.1       | 25.1            |
+| EvalPlus          | Pass@1   | -       | **80.3**    | 65.6            | 66.0       | 65.5            |
 | **Mathematics Tasks** |      |         |              |                  |             |                  |
+| MATH              | EM       | 4-shot  | **70.2**    | 60.1            | 61.0       | 63.0            |
+| GSM8k             | EM       | 8-shot  | **92.1**    | 91.7            | 90.4       | 86.3            |
 | **Chinese Tasks** |          |         |              |                  |             |                  |
+| C-Eval            | EM       | 5-shot  | **92.5**     | 90.0            | 90.9       | 80.9            |
+| CSimpleQA         | Correct  | 5-shot  | **77.6**    | 72.1            | 50.5       | 53.5            |
 </div>
 <sup>
 Currently, Kimi-K2 is recommended to run on the following inference engines:
+* vLLM
 * SGLang
 * KTransformers
+* TensorRT-LLM
+Deployment examples for vLLM and SGLang can be found in the [Model Deployment Guide](docs/deploy_guidance.md).
 ---
     print(response.choices[0].message.content)
 ```
+> [!NOTE]
 > The recommended temperature for Kimi-K2-Instruct is `temperature = 0.6`.
 > If no special instructions are required, the system prompt above is a good default.
 ### Tool Calling
+Kimi-K2-Instruct has strong tool-calling capabilities.
 To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
 The following example demonstrates calling a weather tool end-to-end:
     print(choice.message.content)
 ```
+The `tool_call_with_client` function implements the pipeline from user query to tool execution.
+This pipeline requires the inference engine to support Kimi-K2’s native tool-parsing logic.
 For streaming output and manual tool-parsing, see the [Tool Calling Guide](docs/tool_call_guidance.md).
 ---
 Both the code repository and the model weights are released under the [Modified MIT License](LICENSE).
 ---
 ## 7. Contact Us