liushaowei
commited on
Commit
·
7f98307
1
Parent(s):
2bfbc7b
update readme
Browse files
README.md
CHANGED
|
@@ -43,6 +43,11 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
| 43 |
- **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
|
| 44 |
- **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## 2. Model Summary
|
| 48 |
|
|
@@ -128,7 +133,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
| 128 |
|
| 129 |
<tr>
|
| 130 |
<td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
|
| 131 |
-
<td align="center">Single Patch</td>
|
| 132 |
<td align="center"><ins><strong>51.8</strong></ins></td>
|
| 133 |
<td align="center">36.6</td>
|
| 134 |
<td align="center">39.4</td>
|
|
@@ -188,7 +193,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
| 188 |
|
| 189 |
<tr>
|
| 190 |
<!--<td align="center">TerminalBench</td>-->
|
| 191 |
-
<td align="center">Acc</td>
|
| 192 |
<td align="center"><ins><strong>25.0</strong> </ins></td>
|
| 193 |
<td align="center">16.3</td>
|
| 194 |
<td align="center">6.6</td>
|
|
@@ -495,26 +500,150 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
| 495 |
|
| 496 |
<div align="center">
|
| 497 |
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 518 |
</div>
|
| 519 |
<sup>
|
| 520 |
• We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
|
|
@@ -656,4 +785,4 @@ Both the code repository and the model weights are released under the [Modified
|
|
| 656 |
|
| 657 |
## 7. Contact Us
|
| 658 |
|
| 659 |
-
If you have any questions, please reach out at [[email protected]](mailto:[email protected]).
|
|
|
|
| 43 |
- **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
|
| 44 |
- **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
|
| 45 |
|
| 46 |
+
<div align="center">
|
| 47 |
+
<picture>
|
| 48 |
+
<img src="figures/banner.png" width="80%" alt="Evaluation Results">
|
| 49 |
+
</picture>
|
| 50 |
+
</div>
|
| 51 |
|
| 52 |
## 2. Model Summary
|
| 53 |
|
|
|
|
| 133 |
|
| 134 |
<tr>
|
| 135 |
<td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
|
| 136 |
+
<td align="center">Single Patch w/o Test (Acc)</td>
|
| 137 |
<td align="center"><ins><strong>51.8</strong></ins></td>
|
| 138 |
<td align="center">36.6</td>
|
| 139 |
<td align="center">39.4</td>
|
|
|
|
| 193 |
|
| 194 |
<tr>
|
| 195 |
<!--<td align="center">TerminalBench</td>-->
|
| 196 |
+
<td align="center">Terminus (Acc)</td>
|
| 197 |
<td align="center"><ins><strong>25.0</strong> </ins></td>
|
| 198 |
<td align="center">16.3</td>
|
| 199 |
<td align="center">6.6</td>
|
|
|
|
| 500 |
|
| 501 |
<div align="center">
|
| 502 |
|
| 503 |
+
<table>
|
| 504 |
+
<thead>
|
| 505 |
+
<tr>
|
| 506 |
+
<th align="center">Benchmark</th>
|
| 507 |
+
<th align="center">Metric</th>
|
| 508 |
+
<th align="center">Shot</th>
|
| 509 |
+
<th align="center">Kimi K2 Base</th>
|
| 510 |
+
<th align="center">Deepseek-V3-Base</th>
|
| 511 |
+
<th align="center">Qwen2.5-72B</th>
|
| 512 |
+
<th align="center">Llama 4 Maverick</th>
|
| 513 |
+
</tr>
|
| 514 |
+
</thead>
|
| 515 |
+
<tbody>
|
| 516 |
+
<tr>
|
| 517 |
+
<td align="center" colspan="7"><strong>General Tasks</strong></td>
|
| 518 |
+
</tr>
|
| 519 |
+
<tr>
|
| 520 |
+
<td align="center">MMLU</td>
|
| 521 |
+
<td align="center">EM</td>
|
| 522 |
+
<td align="center">5-shot</td>
|
| 523 |
+
<td align="center"><strong>87.8</strong></td>
|
| 524 |
+
<td align="center">87.1</td>
|
| 525 |
+
<td align="center">86.1</td>
|
| 526 |
+
<td align="center">84.9</td>
|
| 527 |
+
</tr>
|
| 528 |
+
<tr>
|
| 529 |
+
<td align="center">MMLU-pro</td>
|
| 530 |
+
<td align="center">EM</td>
|
| 531 |
+
<td align="center">5-shot</td>
|
| 532 |
+
<td align="center"><strong>69.2</strong></td>
|
| 533 |
+
<td align="center">60.6</td>
|
| 534 |
+
<td align="center">62.8</td>
|
| 535 |
+
<td align="center">63.5</td>
|
| 536 |
+
</tr>
|
| 537 |
+
<tr>
|
| 538 |
+
<td align="center">MMLU-redux-2.0</td>
|
| 539 |
+
<td align="center">EM</td>
|
| 540 |
+
<td align="center">5-shot</td>
|
| 541 |
+
<td align="center"><strong>90.2</strong></td>
|
| 542 |
+
<td align="center">89.5</td>
|
| 543 |
+
<td align="center">87.8</td>
|
| 544 |
+
<td align="center">88.2</td>
|
| 545 |
+
</tr>
|
| 546 |
+
<tr>
|
| 547 |
+
<td align="center">SimpleQA</td>
|
| 548 |
+
<td align="center">Correct</td>
|
| 549 |
+
<td align="center">5-shot</td>
|
| 550 |
+
<td align="center"><strong>35.3</strong></td>
|
| 551 |
+
<td align="center">26.5</td>
|
| 552 |
+
<td align="center">10.3</td>
|
| 553 |
+
<td align="center">23.7</td>
|
| 554 |
+
</tr>
|
| 555 |
+
<tr>
|
| 556 |
+
<td align="center">TriviaQA</td>
|
| 557 |
+
<td align="center">EM</td>
|
| 558 |
+
<td align="center">5-shot</td>
|
| 559 |
+
<td align="center"><strong>85.1</strong></td>
|
| 560 |
+
<td align="center">84.1</td>
|
| 561 |
+
<td align="center">76.0</td>
|
| 562 |
+
<td align="center">79.3</td>
|
| 563 |
+
</tr>
|
| 564 |
+
<tr>
|
| 565 |
+
<td align="center">GPQA-Diamond</td>
|
| 566 |
+
<td align="center">Avg@8</td>
|
| 567 |
+
<td align="center">5-shot</td>
|
| 568 |
+
<td align="center">48.1</td>
|
| 569 |
+
<td align="center"><strong>50.5</strong></td>
|
| 570 |
+
<td align="center">40.8</td>
|
| 571 |
+
<td align="center">49.4</td>
|
| 572 |
+
</tr>
|
| 573 |
+
<tr>
|
| 574 |
+
<td align="center">SuperGPQA</td>
|
| 575 |
+
<td align="center">EM</td>
|
| 576 |
+
<td align="center">5-shot</td>
|
| 577 |
+
<td align="center"><strong>44.7</strong></td>
|
| 578 |
+
<td align="center">39.2</td>
|
| 579 |
+
<td align="center">34.2</td>
|
| 580 |
+
<td align="center">38.8</td>
|
| 581 |
+
</tr>
|
| 582 |
+
<tr>
|
| 583 |
+
<td align="center" colspan="7"><strong>Coding Tasks</strong></td>
|
| 584 |
+
</tr>
|
| 585 |
+
<tr>
|
| 586 |
+
<td align="center">LiveCodeBench v6</td>
|
| 587 |
+
<td align="center">Pass@1</td>
|
| 588 |
+
<td align="center">1-shot</td>
|
| 589 |
+
<td align="center"><strong>26.3</strong></td>
|
| 590 |
+
<td align="center">22.9</td>
|
| 591 |
+
<td align="center">21.1</td>
|
| 592 |
+
<td align="center">25.1</td>
|
| 593 |
+
</tr>
|
| 594 |
+
<tr>
|
| 595 |
+
<td align="center">EvalPlus</td>
|
| 596 |
+
<td align="center">Pass@1</td>
|
| 597 |
+
<td align="center">-</td>
|
| 598 |
+
<td align="center"><strong>80.3</strong></td>
|
| 599 |
+
<td align="center">65.6</td>
|
| 600 |
+
<td align="center">66.0</td>
|
| 601 |
+
<td align="center">65.5</td>
|
| 602 |
+
</tr>
|
| 603 |
+
<tr>
|
| 604 |
+
<td align="center" colspan="7"><strong>Mathematics Tasks</strong></td>
|
| 605 |
+
</tr>
|
| 606 |
+
<tr>
|
| 607 |
+
<td align="center">MATH</td>
|
| 608 |
+
<td align="center">EM</td>
|
| 609 |
+
<td align="center">4-shot</td>
|
| 610 |
+
<td align="center"><strong>70.2</strong></td>
|
| 611 |
+
<td align="center">60.1</td>
|
| 612 |
+
<td align="center">61.0</td>
|
| 613 |
+
<td align="center">63.0</td>
|
| 614 |
+
</tr>
|
| 615 |
+
<tr>
|
| 616 |
+
<td align="center">GSM8k</td>
|
| 617 |
+
<td align="center">EM</td>
|
| 618 |
+
<td align="center">8-shot</td>
|
| 619 |
+
<td align="center"><strong>92.1</strong></td>
|
| 620 |
+
<td align="center">91.7</td>
|
| 621 |
+
<td align="center">90.4</td>
|
| 622 |
+
<td align="center">86.3</td>
|
| 623 |
+
</tr>
|
| 624 |
+
<tr>
|
| 625 |
+
<td align="center" colspan="7"><strong>Chinese Tasks</strong></td>
|
| 626 |
+
</tr>
|
| 627 |
+
<tr>
|
| 628 |
+
<td align="center">C-Eval</td>
|
| 629 |
+
<td align="center">EM</td>
|
| 630 |
+
<td align="center">5-shot</td>
|
| 631 |
+
<td align="center"><strong>92.5</strong></td>
|
| 632 |
+
<td align="center">90.0</td>
|
| 633 |
+
<td align="center">90.9</td>
|
| 634 |
+
<td align="center">80.9</td>
|
| 635 |
+
</tr>
|
| 636 |
+
<tr>
|
| 637 |
+
<td align="center">CSimpleQA</td>
|
| 638 |
+
<td align="center">Correct</td>
|
| 639 |
+
<td align="center">5-shot</td>
|
| 640 |
+
<td align="center"><strong>77.6</strong></td>
|
| 641 |
+
<td align="center">72.1</td>
|
| 642 |
+
<td align="center">50.5</td>
|
| 643 |
+
<td align="center">53.5</td>
|
| 644 |
+
</tr>
|
| 645 |
+
</tbody>
|
| 646 |
+
</table>
|
| 647 |
</div>
|
| 648 |
<sup>
|
| 649 |
• We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
|
|
|
|
| 785 |
|
| 786 |
## 7. Contact Us
|
| 787 |
|
| 788 |
+
If you have any questions, please reach out at [[email protected]](mailto:[email protected]).
|