moonshotai
/

Kimi-VL-A3B-Thinking-2506

@@ -37,36 +37,37 @@ This is an updated version of [Kimi-VL-A3B-Thinking](https://huggingface.co/moon
 ## 2. Performance
-Comparison with efficient models and two previous versions of Kimi-VL:
 <div align="center">
 | Benchmark (Metric)         | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
 |----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
 | **General Multimodal**   |        |               |               |                      |                      |                          |
-| MMBench-EN-v1.1 (Acc)          | 83.1   | 83.2          | 74.6          | 82.9                 | 76.0                 | **84.4**                    |
-| RealWorldQA (Acc)          | 75.4   | 68.5          | 59.1          | 68.1                 | 64.0                 | **70.0**                     |
-| OCRBench (Acc)             | 815    | 864           | 702           | 864                  | 864                  | **869**                      |
-| MMStar (Acc)               | 64.7   | 63.0          | 56.1          | 61.7                 | 64.2                 | **70.4**                     |
-| MMVet (Acc)               | 69.1   | 67.1      | 64.9        | 66.7        |    69.5 |   **78.1** |
 | **Reasoning**            |        |               |               |                      |                      |                          |
-| MMMU (val, Pass@1)         | 69.1   | 58.6          | 59.6          | 57.0                 | 61.7                 | **64.0**                     |
-| MMMU-Pro (Pass@1)          | 51.7   | 38.1          | 32.1          | 36.0                  | 43.2               | **46.3**                    |
 | **Math**                   |        |               |               |                      |                      |                          |
-| MATH-Vision (Pass@1)       | 30.4   | 25.0          | 32.1          | 21.7                | 36.8                | **56.9**                    |
-| MathVista_MINI (Pass@1)    | 63.8   | 68.0          | 56.1          | 68.6                | 71.7                | **80.1**                    |
 | **Video**                  |        |               |               |                      |                      |                          |
-| VideoMMMU (Pass@1)         | 61.2   | 47.4          | 57.0          | 52.1                | 55.5                | **65.2**                    |
-| MMVU (Pass@1)              | 67.4   | 50.1          | 57.0          | 52.7                | 53.0                | **57.5**                    |
-| Video-MME (w/ sub.)        | 77.2   | 71.6          | 62.1          | **72.7**                | 66.0                    | 71.9                    |
 | **Agent Grounding**        |        |               |               |                      |                      |                          |
-| ScreenSpot-Pro (Acc)               | 0.8    | 29.0          | —             | 35.4                | —                    | **52.8**                    |
-| ScreenSpot-V2 (Acc)                | 18.1   | 84.2          | —             | **92.8**                | —                    | 91.4                    |
-| OSWorld-G (Acc)            | -     | 31.5           | —             | 41.6                | —                    | **52.5**                    |
 | **Long Document**          |        |               |               |                      |                      |                          |
-| MMLongBench-DOC (Acc)          | 42.8   | 29.6          | 21.3          | 35.1                | 32.5                | **42.1**                    |
 </div>
 Comparison with 30B-70B open-source models:
 <div align="center">

 ## 2. Performance
+Comparison with efficient models and two previous versions of Kimi-VL (*Results of GPT-4o is for reference here, and shown in <i>italics</i>):
 <div align="center">
 | Benchmark (Metric)         | GPT-4o | Qwen2.5-VL-7B | Gemma3-12B-IT | Kimi-VL-A3B-Instruct | Kimi-VL-A3B-Thinking | Kimi-VL-A3B-Thinking-2506 |
 |----------------------------|--------|---------------|---------------|----------------------|----------------------|--------------------------|
 | **General Multimodal**   |        |               |               |                      |                      |                          |
+| MMBench-EN-v1.1 (Acc)          | *83.1*   | 83.2          | 74.6          | 82.9                 | 76.0                 | **84.4**                    |
+| RealWorldQA (Acc)          | *75.4*   | 68.5          | 59.1          | 68.1                 | 64.0                 | **70.0**                     |
+| OCRBench (Acc)             | *815*    | 864           | 702           | 864                  | 864                  | **869**                      |
+| MMStar (Acc)               | *64.7*   | 63.0          | 56.1          | 61.7                 | 64.2                 | **70.4**                     |
+| MMVet (Acc)               | *69.1*   | 67.1      | 64.9        | 66.7        |    69.5 |   **78.1** |
 | **Reasoning**            |        |               |               |                      |                      |                          |
+| MMMU (val, Pass@1)         | *69.1*   | 58.6          | 59.6          | 57.0                 | 61.7                 | **64.0**                     |
+| MMMU-Pro (Pass@1)          | *51.7*   | 38.1          | 32.1          | 36.0                  | 43.2               | **46.3**                    |
 | **Math**                   |        |               |               |                      |                      |                          |
+| MATH-Vision (Pass@1)       | *30.4*   | 25.0          | 32.1          | 21.7                | 36.8                | **56.9**                    |
+| MathVista_MINI (Pass@1)    | *63.8*   | 68.0          | 56.1          | 68.6                | 71.7                | **80.1**                    |
 | **Video**                  |        |               |               |                      |                      |                          |
+| VideoMMMU (Pass@1)         | *61.2*   | 47.4          | 57.0          | 52.1                | 55.5                | **65.2**                    |
+| MMVU (Pass@1)              | *67.4*   | 50.1          | 57.0          | 52.7                | 53.0                | **57.5**                    |
+| Video-MME (w/ sub.)        | *77.2*   | 71.6          | 62.1          | **72.7**                | 66.0                    | 71.9                    |
 | **Agent Grounding**        |        |               |               |                      |                      |                          |
+| ScreenSpot-Pro (Acc)               | *0.8*    | 29.0          | —             | 35.4                | —                    | **52.8**                    |
+| ScreenSpot-V2 (Acc)                | *18.1*   | 84.2          | —             | **92.8**                | —                    | 91.4                    |
+| OSWorld-G (Acc)            | -     | *31.5*           | —             | 41.6                | —                    | **52.5**                    |
 | **Long Document**          |        |               |               |                      |                      |                          |
+| MMLongBench-DOC (Acc)          | *42.8*   | 29.6          | 21.3          | 35.1                | 32.5                | **42.1**                    |
 </div>
 Comparison with 30B-70B open-source models:
 <div align="center">