Text Generation
Transformers
Safetensors
English
ddllama
conversational
custom_code

thanks for your wonderful work. question about accuracy in paper

#2
by baibizhe - opened

thanks for your wonderful work. i have a question about accuracy in paper about qwen2.5-3b-instruct.
from original report of qwen2.5 report. qwen2.5-3b-instruct achieve 79.09 on gsm8k dataset. however you report 0.576 on qwen2.5-3b-instruct in table 2. this is a hug gap. do I misunderstand something at here ? Thanks
https://arxiv.org/pdf/2412.15115

image.png
.

Thanks for your question! The difference in results is due to different benchmark settings and prompting approaches.

We used lm-evaluation-harness for consistent evaluation across all models, with its default prompts and settings. This might be different from the original Qwen2.5 evaluation setup, which explains the performance gap you observed.

You can reproduce our results using lm-evaluation-harness v0.4.8 with the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --main_process_port 29531 -m lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen2.5-3B-Instruct,trust_remote_code=True \
  --tasks gsm8k \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size 4 > 4-gsm8k.out 2>&1

Sign up or log in to comment