AlpachinoNLP
/

Baichuan-7B-Instruction

@@ -103,6 +103,20 @@ model = model.quantize(4).cuda()
 ## [CMMLU](https://github.com/haonan-li/CMMLU)
 | Model zero-shot                                              |   STEM    | Humanities | Social Sciences |  Others   | China Specific |  Average  |
 | ------------------------------------------------------------ | :-------: | :--------: | :-------------: | :-------: | :------------: | :-------: |
 | [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)      |   41.28   |   52.85    |      53.37      |   52.24   |     50.58      |   49.95   |
@@ -114,19 +128,18 @@ model = model.quantize(4).cuda()
 | [Chinese-GLM-10B](https://github.com/THUDM/GLM)              |   25.57   |   25.01    |      26.33      |   25.94   |     25.81      |   25.80   |
 | [Baichuan-13B](https://github.com/baichuan-inc/Baichuan-7B)  |   42.04   |   60.49    |      59.55      |   56.60   |     55.72      |   54.63   |
 | [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-7B) |   37.32   |   56.24    |      54.79      |   54.07   |     52.23      |   50.48   |
-| **Baichuan-13B-Instruction**                                 | **42.56** | **62.09**  |    **60.41**    | **58.97** |   **56.95**    | **55.88** |
 | **Baichuan-7B-Instruction**                                  | **33.94** | **46.31**  |    **47.73**    | **45.84** |   **44.88**    | **43.53** |
 > 说明：CMMLU 是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到，其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.
-### English Leaderboard
-In addition to Chinese, we also tested the model's performance in English.
 #### MMLU
-[MMLU](https://arxiv.org/abs/2009.03300) is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.
-We adopted the [open-source]((https://github.com/hendrycks/test)) evaluation scheme, and the final 5-shot results are as follows:
 | Model                                  | Humanities | Social Sciences | STEM | Other | Average |
 |----------------------------------------|-----------:|:---------------:|:----:|:-----:|:-------:|
@@ -139,4 +152,8 @@ We adopted the [open-source]((https://github.com/hendrycks/test)) evaluation sch
 | moss-moon-003-base (16B)<sup>0</sup>   |       24.2 |      22.8       | 22.4 | 24.4  |  23.6   |
 | moss-moon-003-sft (16B)<sup>0</sup>    |       30.5 |      33.8       | 29.3 | 34.4  |  31.9   |
 | Baichuan-7B<sup>0</sup>                |       38.4 |      48.9       | 35.6 | 48.1  |  42.3   |
-| **Baichuan-7B<sup>0</sup>**            |       **38.9** |      **49.0**       | **35.3** | **48.8**  |  **42.6**   |

 ## [CMMLU](https://github.com/haonan-li/CMMLU)
+| Model 5-shot                                               |   STEM    | Humanities | Social Sciences |  Others  | China Specific | Average  |
+| ---------------------------------------------------------- | :-------: | :--------: | :-------------: | :------: | :------------: | :------: |
+| Baichuan-7B |   34.4    |    47.5    |      47.6       |   46.6   |      44.3      |   44.0   |
+| Vicuna-13B                                                 |   31.8    |    36.2    |      37.6       |   39.5   |      34.3      |   36.3   |
+| Chinese-Alpaca-Plus-13B                                    |   29.8    |    33.4    |      33.2       |   37.9   |      32.1      |   33.4   |
+| Chinese-LLaMA-Plus-13B                                     |   28.1    |    33.1    |      35.4       |   35.1   |      33.5      |   33.0   |
+| Ziya-LLaMA-13B-Pretrain                                    |   29.0    |    30.7    |      33.8       |   34.4   |      31.9      |   32.1   |
+| LLaMA-13B                                                  |   29.2    |    30.8    |      31.6       |   33.0   |      30.5      |   31.2   |
+| moss-moon-003-base (16B)                                   |   27.2    |    30.4    |      28.8       |   32.6   |      28.7      |   29.6   |
+| Baichuan-13B-Base                                          |   41.7    |    61.1    |      59.8       |   59.0   |      56.4      |   55.3   |
+| Baichuan-13B-Chat                                          |   42.8    |  62.6  |    59.7  | 59.0 |    56.1    | 55.8 |
+| Baichuan-13B-Instruction                              | 44.50 |   61.16    |      59.07      |  58.34   |     55.55      |  55.61   |
+| **Baichuan-7B-Instruction**                                  | **34.68** | **47.38**  |    **47.13**    | **45.11** |   **44.51**    | **43.57** |
 | Model zero-shot                                              |   STEM    | Humanities | Social Sciences |  Others   | China Specific |  Average  |
 | ------------------------------------------------------------ | :-------: | :--------: | :-------------: | :-------: | :------------: | :-------: |
 | [ChatGLM2-6B](https://huggingface.co/THUDM/chatglm2-6b)      |   41.28   |   52.85    |      53.37      |   52.24   |     50.58      |   49.95   |
 | [Chinese-GLM-10B](https://github.com/THUDM/GLM)              |   25.57   |   25.01    |      26.33      |   25.94   |     25.81      |   25.80   |
 | [Baichuan-13B](https://github.com/baichuan-inc/Baichuan-7B)  |   42.04   |   60.49    |      59.55      |   56.60   |     55.72      |   54.63   |
 | [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-7B) |   37.32   |   56.24    |      54.79      |   54.07   |     52.23      |   50.48   |
+| Baichuan-13B-Instruction                                 | 42.56 | 62.09  |    60.41   | 58.97 |   56.95    | 55.88 |
 | **Baichuan-7B-Instruction**                                  | **33.94** | **46.31**  |    **47.73**    | **45.84** |   **44.88**    | **43.53** |
 > 说明：CMMLU 是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到，其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.
+### 英文能力评测
+除了中文榜单的测试，我们同样测试了模型在英文榜单 MMLU 上的能力。
 #### MMLU
+[MMLU](https://arxiv.org/abs/2009.03300) 是一个包含了57种任务的英文评测数据集。
+我们采用了开源的[评测方案]((https://github.com/hendrycks/test)) , 评测结果如下:
 | Model                                  | Humanities | Social Sciences | STEM | Other | Average |
 |----------------------------------------|-----------:|:---------------:|:----:|:-----:|:-------:|
 | moss-moon-003-base (16B)<sup>0</sup>   |       24.2 |      22.8       | 22.4 | 24.4  |  23.6   |
 | moss-moon-003-sft (16B)<sup>0</sup>    |       30.5 |      33.8       | 29.3 | 34.4  |  31.9   |
 | Baichuan-7B<sup>0</sup>                |       38.4 |      48.9       | 35.6 | 48.1  |  42.3   |
+| **Baichuan-7B-Instruction(5-shot)**            |       **38.9** |      **49.0**       | **35.3** | **48.8**  |  **42.6**   |
+| **Baichuan-7B-Instruction(0-shot)**            |       **38.7** |      **47.9**       | **34.5** | **48.2**  |  **42.0**   |
+> 说明：CMMLU 是一个综合性的中文评估基准，专门用于评估语言模型在中文语境下的知识和推理能力。我们直接使用其官方的[评测脚本](https://github.com/haonan-li/CMMLU)对模型进行评测。Model zero-shot 表格中 [Baichuan-13B-Chat](https://github.com/baichuan-inc/Baichuan-13B) 的得分来自我们直接运行 CMMLU 官方的评测脚本得到，其他模型的的得分来自于 [CMMLU](https://github.com/haonan-li/CMMLU/tree/master) 官方的评测结果.