microsoft/MAI-DS-R1 · Can anyone benchmark it against DeepSeekR10528? I didn't find any precise benchmark data.

Jun 3

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

Jun 3

Because I feel that the two models are similar in tone and style, I want to see how the benchmark compares.

I found that this model is not bad. It is a fine-tuning of R1. The answer style and reasoning are different, but it is not an ideological fine-tuning. I don’t think there is anything wrong with its answer. It didn’t even find the boundary of slandering the Chinese Communist Party.

For jailbreaking, it seems to be more difficult than R1. After jailbreaking, R1’s answer is still coherent and logical, but it is easy to reject again, and the answer effect is also relatively poor.

Therefore, the last thing it should do is to benchmark with R1-1776. A model built on ideology does not need to test the benchmark.

likewendy changed discussion status to closed Jun 3

likewendy changed discussion status to open Jun 3

Satyam-Singh

Jul 12

🧠 Evaluation on General Knowledge and Reasoning

Categories	Benchmarks	Metrics	DS-R1	R1-0528	MAI-DS-R1
General Knowledge	anli_r30	7-shot Acc	0.686	0.673	0.697
	arc_challenge	10-shot Acc	0.963	0.963	0.963
	hellaswag	5-shot Acc	0.864	0.860	0.859
	mmlu (all)	5-shot Acc	0.867	0.863	0.870
	mmlu/humanities	5-shot Acc	0.794	0.784	0.801
	mmlu/other	5-shot Acc	0.883	0.879	0.886
	mmlu/social_sciences	5-shot Acc	0.916	0.916	0.914
	mmlu/STEM	5-shot Acc	0.867	0.864	0.870
	openbookqa	10-shot Acc	0.936	0.938	0.954
	Piqa	5-shot Acc	0.933	0.926	0.939
	Winogrande	5-shot Acc	0.843	0.834	0.850
Math	gsm8k_chain_of_thought	0-shot Accuracy	0.953	0.954	0.949
	Math	4-shot Accuracy	0.833	0.853	0.843
	mgsm_chain_of_thought_en	0-shot Accuracy	0.972	0.968	0.976
	mgsm_chain_of_thought_zh	0-shot Accuracy	0.880	0.796	0.900
	AIME 2024	Pass@1, n=2	0.7333	0.7333	0.7333
Code	humaneval	0-shot Accuracy	0.866	0.841	0.860
	livecodebench (8k tokens)	0-shot Pass@1	0.531	0.484	0.632
	LCB_coding_completion	0-shot Pass@1	0.260	0.200	0.540
	LCB_generation	0-shot Pass@1	0.700	0.670	0.692
	mbpp	3-shot Pass@1	0.897	0.874	0.911

🚫 Evaluation on Blocked Topics

Benchmark	Metric	DS-R1	R1-0528	MAI-DS-R1
Blocked topics test set	Answer Satisfaction	1.68	2.76	3.62
	% uncensored	30.7	99.1	99.3

🔐 Evaluation on Safety

Categories	DS-R1 (Answer)	R1-0528(Answer)	MAI-DS-R1 (Answer)	DS-R1 (Thinking)	R1-0528(Thinking)	MAI-DS-R1 (Thinking)
Micro Attack Success Rate	0.441	0.481	0.209	0.394	0.325	0.134
Functional Standard	0.258	0.289	0.126	0.302	0.214	0.082
Functional Contextual	0.494	0.556	0.321	0.506	0.395	0.309
Functional Copyright	0.750	0.787	0.263	0.463	0.475	0.062
Semantic Misinfo/Disinfo	0.500	0.648	0.315	0.519	0.500	0.259
Semantic Chemical/Bio	0.357	0.429	0.143	0.500	0.286	0.167
Semantic Illegal	0.189	0.170	0.019	0.321	0.245	0.019
Semantic Harmful	0.111	0.111	0.111	0.111	0.111	0.000
Semantic Copyright	0.750	0.787	0.263	0.463	0.475	0.062
Semantic Cybercrime	0.519	0.500	0.385	0.385	0.212	0.308
Semantic Harassment	0.000	0.048	0.000	0.048	0.048	0.000
Num Parse Errors	4	20	0	26	67	0

📌 Summary

General Knowledge & Reasoning: MAI-DS-R1 performs on par with DeepSeek-R1 and slightly better than R1-0528, particularly excelling in mgsm_chain_of_thought_zh, where R1-0528showed a notable drop.
Blocked Topics: MAI-DS-R1 blocks 99.3% of problematic prompts (matching R1-0528) and scores highest in Answer Satisfaction.
Safety: MAI-DS-R1 significantly outperforms both DS-R1 and R1-0528in safety categories, especially in reducing harmful, illegal, or misleading outputs.