gradient_accumulation_steps/batchsize

by Shuigs - opened 16 days ago

16 days ago

The model OpenR1-Qwen-7B looks like is 3150step which is best. So could you tell some details about the the gradient_accumulation_steps and batchsize?
I have check the recipes/OpenR1-Qwen-7B/sft/config.yaml and find the gradient_accumulation_steps is 2, batchsize is 1. Whether it dosen't need to train too many epochs?

jd650

16 days ago

the choice of gradient_accumulation_steps = 2 and batch_size = 1 (resulting in an effective batch size of 2) is a deliberate setting for fine-tuning Qwen-7B. It leverages the pretrained capabilities of the model and balances computational/memory efficiency with the need for sufficient gradient signal from the fine-tuning data, making 3150 steps a reasonable target for the task.

Shuigs

16 days ago

I use the OpenR1-Math-220k dataset(default) to train for 3 epochs, which takes a total of 102750 steps (gradient_accumulation_steps = 2 and batch_size = 1), so I can actually stop at 3150 steps?

Shuigs

13 days ago

I have understand why my experiment have more than 10w+ steps. My trl version is a different version, causing max_length to not work. Because in my trl version, it should be max_seq_length. After fixing that errors, my total step is 3219 steps.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment