deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

10 days ago

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	-	57.58	-	-	-	-

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

lewtun

8 days ago

Any idea what explains the diff in your eval scores and those from the paper?

Wanfq

8 days ago

Any idea what explains the diff in your eval scores and those from the paper?

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper. We will update all the results tomorrow. Please stay tuned.

Wanfq

7 days ago

Any idea what explains the diff in your eval scores and those from the paper?

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.

We have finished all the evaluation and updated the results here:

The reproduce details can be found in our blog: https://huggingface.co/blog/Wanfq/fuseo1-preview

We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview

Our models are in : https://huggingface.co/collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977

Have fun!

deepseek-ai
/

DeepSeek-R1-Distill-Qwen-1.5B

System Prompt