System Prompt

#3
by Wanfq - opened

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

Any idea what explains the diff in your eval scores and those from the paper?

Screenshot 2025-01-22 at 11.02.10.png

Any idea what explains the diff in your eval scores and those from the paper?

Screenshot 2025-01-22 at 11.02.10.png

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper. We will update all the results tomorrow. Please stay tuned.

Any idea what explains the diff in your eval scores and those from the paper?

Screenshot 2025-01-22 at 11.02.10.png

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.

We have finished all the evaluation and updated the results here:

fuseo1-preview-low.jpg

The reproduce details can be found in our blog: https://huggingface.co/blog/Wanfq/fuseo1-preview

We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview

Our models are in : https://huggingface.co/collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977

Have fun!

Sign up or log in to comment