System Prompt
We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.
Models | AIME24 | MATH500 | GSM8K | GPQA-Diamond | ARC-Challenge | MMLU-Pro | MMLU | LiveCodeBench |
---|---|---|---|---|---|---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 46.67 | 88.20 | - | 57.58 | - | - | - | - |
More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview
Any idea what explains the diff in your eval scores and those from the paper?
We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper. We will update all the results tomorrow. Please stay tuned.
Any idea what explains the diff in your eval scores and those from the paper?
We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.
We have finished all the evaluation and updated the results here:
The reproduce details can be found in our blog: https://huggingface.co/blog/Wanfq/fuseo1-preview
We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview
Our models are in : https://huggingface.co/collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977
Have fun!