Can anyone benchmark it against DeepSeekR10528? I didn't find any precise benchmark data.
#13
by
likewendy
- opened
Because I feel that the two models are similar in tone and style, I want to see how the benchmark compares.
I found that this model is not bad. It is a fine-tuning of R1. The answer style and reasoning are different, but it is not an ideological fine-tuning. I don’t think there is anything wrong with its answer. It didn’t even find the boundary of slandering the Chinese Communist Party.
For jailbreaking, it seems to be more difficult than R1. After jailbreaking, R1’s answer is still coherent and logical, but it is easy to reject again, and the answer effect is also relatively poor.
Therefore, the last thing it should do is to benchmark with R1-1776. A model built on ideology does not need to test the benchmark.
likewendy
changed discussion status to
closed
likewendy
changed discussion status to
open
🧠 Evaluation on General Knowledge and Reasoning
Categories | Benchmarks | Metrics | DS-R1 | R1-0528 | MAI-DS-R1 |
---|---|---|---|---|---|
General Knowledge | anli_r30 | 7-shot Acc | 0.686 | 0.673 | 0.697 |
arc_challenge | 10-shot Acc | 0.963 | 0.963 | 0.963 | |
hellaswag | 5-shot Acc | 0.864 | 0.860 | 0.859 | |
mmlu (all) | 5-shot Acc | 0.867 | 0.863 | 0.870 | |
mmlu/humanities | 5-shot Acc | 0.794 | 0.784 | 0.801 | |
mmlu/other | 5-shot Acc | 0.883 | 0.879 | 0.886 | |
mmlu/social_sciences | 5-shot Acc | 0.916 | 0.916 | 0.914 | |
mmlu/STEM | 5-shot Acc | 0.867 | 0.864 | 0.870 | |
openbookqa | 10-shot Acc | 0.936 | 0.938 | 0.954 | |
Piqa | 5-shot Acc | 0.933 | 0.926 | 0.939 | |
Winogrande | 5-shot Acc | 0.843 | 0.834 | 0.850 | |
Math | gsm8k_chain_of_thought | 0-shot Accuracy | 0.953 | 0.954 | 0.949 |
Math | 4-shot Accuracy | 0.833 | 0.853 | 0.843 | |
mgsm_chain_of_thought_en | 0-shot Accuracy | 0.972 | 0.968 | 0.976 | |
mgsm_chain_of_thought_zh | 0-shot Accuracy | 0.880 | 0.796 | 0.900 | |
AIME 2024 | Pass@1, n=2 | 0.7333 | 0.7333 | 0.7333 | |
Code | humaneval | 0-shot Accuracy | 0.866 | 0.841 | 0.860 |
livecodebench (8k tokens) | 0-shot Pass@1 | 0.531 | 0.484 | 0.632 | |
LCB_coding_completion | 0-shot Pass@1 | 0.260 | 0.200 | 0.540 | |
LCB_generation | 0-shot Pass@1 | 0.700 | 0.670 | 0.692 | |
mbpp | 3-shot Pass@1 | 0.897 | 0.874 | 0.911 |
🚫 Evaluation on Blocked Topics
Benchmark | Metric | DS-R1 | R1-0528 | MAI-DS-R1 |
---|---|---|---|---|
Blocked topics test set | Answer Satisfaction | 1.68 | 2.76 | 3.62 |
% uncensored | 30.7 | 99.1 | 99.3 |
🔐 Evaluation on Safety
Categories | DS-R1 (Answer) | R1-0528(Answer) | MAI-DS-R1 (Answer) | DS-R1 (Thinking) | R1-0528(Thinking) | MAI-DS-R1 (Thinking) |
---|---|---|---|---|---|---|
Micro Attack Success Rate | 0.441 | 0.481 | 0.209 | 0.394 | 0.325 | 0.134 |
Functional Standard | 0.258 | 0.289 | 0.126 | 0.302 | 0.214 | 0.082 |
Functional Contextual | 0.494 | 0.556 | 0.321 | 0.506 | 0.395 | 0.309 |
Functional Copyright | 0.750 | 0.787 | 0.263 | 0.463 | 0.475 | 0.062 |
Semantic Misinfo/Disinfo | 0.500 | 0.648 | 0.315 | 0.519 | 0.500 | 0.259 |
Semantic Chemical/Bio | 0.357 | 0.429 | 0.143 | 0.500 | 0.286 | 0.167 |
Semantic Illegal | 0.189 | 0.170 | 0.019 | 0.321 | 0.245 | 0.019 |
Semantic Harmful | 0.111 | 0.111 | 0.111 | 0.111 | 0.111 | 0.000 |
Semantic Copyright | 0.750 | 0.787 | 0.263 | 0.463 | 0.475 | 0.062 |
Semantic Cybercrime | 0.519 | 0.500 | 0.385 | 0.385 | 0.212 | 0.308 |
Semantic Harassment | 0.000 | 0.048 | 0.000 | 0.048 | 0.048 | 0.000 |
Num Parse Errors | 4 | 20 | 0 | 26 | 67 | 0 |
📌 Summary
- General Knowledge & Reasoning: MAI-DS-R1 performs on par with DeepSeek-R1 and slightly better than R1-0528, particularly excelling in
mgsm_chain_of_thought_zh
, where R1-0528showed a notable drop. - Blocked Topics: MAI-DS-R1 blocks 99.3% of problematic prompts (matching R1-0528) and scores highest in Answer Satisfaction.
- Safety: MAI-DS-R1 significantly outperforms both DS-R1 and R1-0528in safety categories, especially in reducing harmful, illegal, or misleading outputs.