Can anyone benchmark it against DeepSeekR10528? I didn't find any precise benchmark data.

#13
by likewendy - opened

Because I feel that the two models are similar in tone and style, I want to see how the benchmark compares.

I found that this model is not bad. It is a fine-tuning of R1. The answer style and reasoning are different, but it is not an ideological fine-tuning. I don’t think there is anything wrong with its answer. It didn’t even find the boundary of slandering the Chinese Communist Party.

For jailbreaking, it seems to be more difficult than R1. After jailbreaking, R1’s answer is still coherent and logical, but it is easy to reject again, and the answer effect is also relatively poor.

Therefore, the last thing it should do is to benchmark with R1-1776. A model built on ideology does not need to test the benchmark.

likewendy changed discussion status to closed
likewendy changed discussion status to open

🧠 Evaluation on General Knowledge and Reasoning

Categories Benchmarks Metrics DS-R1 R1-0528 MAI-DS-R1
General Knowledge anli_r30 7-shot Acc 0.686 0.673 0.697
arc_challenge 10-shot Acc 0.963 0.963 0.963
hellaswag 5-shot Acc 0.864 0.860 0.859
mmlu (all) 5-shot Acc 0.867 0.863 0.870
mmlu/humanities 5-shot Acc 0.794 0.784 0.801
mmlu/other 5-shot Acc 0.883 0.879 0.886
mmlu/social_sciences 5-shot Acc 0.916 0.916 0.914
mmlu/STEM 5-shot Acc 0.867 0.864 0.870
openbookqa 10-shot Acc 0.936 0.938 0.954
Piqa 5-shot Acc 0.933 0.926 0.939
Winogrande 5-shot Acc 0.843 0.834 0.850
Math gsm8k_chain_of_thought 0-shot Accuracy 0.953 0.954 0.949
Math 4-shot Accuracy 0.833 0.853 0.843
mgsm_chain_of_thought_en 0-shot Accuracy 0.972 0.968 0.976
mgsm_chain_of_thought_zh 0-shot Accuracy 0.880 0.796 0.900
AIME 2024 Pass@1, n=2 0.7333 0.7333 0.7333
Code humaneval 0-shot Accuracy 0.866 0.841 0.860
livecodebench (8k tokens) 0-shot Pass@1 0.531 0.484 0.632
LCB_coding_completion 0-shot Pass@1 0.260 0.200 0.540
LCB_generation 0-shot Pass@1 0.700 0.670 0.692
mbpp 3-shot Pass@1 0.897 0.874 0.911

🚫 Evaluation on Blocked Topics

Benchmark Metric DS-R1 R1-0528 MAI-DS-R1
Blocked topics test set Answer Satisfaction 1.68 2.76 3.62
% uncensored 30.7 99.1 99.3

🔐 Evaluation on Safety

Categories DS-R1 (Answer) R1-0528(Answer) MAI-DS-R1 (Answer) DS-R1 (Thinking) R1-0528(Thinking) MAI-DS-R1 (Thinking)
Micro Attack Success Rate 0.441 0.481 0.209 0.394 0.325 0.134
Functional Standard 0.258 0.289 0.126 0.302 0.214 0.082
Functional Contextual 0.494 0.556 0.321 0.506 0.395 0.309
Functional Copyright 0.750 0.787 0.263 0.463 0.475 0.062
Semantic Misinfo/Disinfo 0.500 0.648 0.315 0.519 0.500 0.259
Semantic Chemical/Bio 0.357 0.429 0.143 0.500 0.286 0.167
Semantic Illegal 0.189 0.170 0.019 0.321 0.245 0.019
Semantic Harmful 0.111 0.111 0.111 0.111 0.111 0.000
Semantic Copyright 0.750 0.787 0.263 0.463 0.475 0.062
Semantic Cybercrime 0.519 0.500 0.385 0.385 0.212 0.308
Semantic Harassment 0.000 0.048 0.000 0.048 0.048 0.000
Num Parse Errors 4 20 0 26 67 0

📌 Summary

  • General Knowledge & Reasoning: MAI-DS-R1 performs on par with DeepSeek-R1 and slightly better than R1-0528, particularly excelling in mgsm_chain_of_thought_zh, where R1-0528showed a notable drop.
  • Blocked Topics: MAI-DS-R1 blocks 99.3% of problematic prompts (matching R1-0528) and scores highest in Answer Satisfaction.
  • Safety: MAI-DS-R1 significantly outperforms both DS-R1 and R1-0528in safety categories, especially in reducing harmful, illegal, or misleading outputs.

Sign up or log in to comment