thanks for your wonderful work. question about accuracy in paper

by baibizhe - opened Aug 24

Aug 24

thanks for your wonderful work. i have a question about accuracy in paper about qwen2.5-3b-instruct.
from original report of qwen2.5 report. qwen2.5-3b-instruct achieve 79.09 on gsm8k dataset. however you report 0.576 on qwen2.5-3b-instruct in table 2. this is a hug gap. do I misunderstand something at here ? Thanks
https://arxiv.org/pdf/2412.15115

xuan-luo

Owner about 1 month ago

Thanks for your question! The difference in results is due to different benchmark settings and prompting approaches.

We used lm-evaluation-harness for consistent evaluation across all models, with its default prompts and settings. This might be different from the original Qwen2.5 evaluation setup, which explains the performance gap you observed.

You can reproduce our results using lm-evaluation-harness v0.4.8 with the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --main_process_port 29531 -m lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen2.5-3B-Instruct,trust_remote_code=True \
  --tasks gsm8k \
  --num_fewshot 5 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --batch_size 4 > 4-gsm8k.out 2>&1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment