gradient_accumulation_steps/batchsize
The model OpenR1-Qwen-7B looks like is 3150step which is best. So could you tell some details about the the gradient_accumulation_steps and batchsize?
I have check the recipes/OpenR1-Qwen-7B/sft/config.yaml and find the gradient_accumulation_steps is 2, batchsize is 1. Whether it dosen't need to train too many epochs?
the choice of gradient_accumulation_steps = 2 and batch_size = 1 (resulting in an effective batch size of 2) is a deliberate setting for fine-tuning Qwen-7B. It leverages the pretrained capabilities of the model and balances computational/memory efficiency with the need for sufficient gradient signal from the fine-tuning data, making 3150 steps a reasonable target for the task.
I use the OpenR1-Math-220k dataset(default) to train for 3 epochs, which takes a total of 102750 steps (gradient_accumulation_steps = 2 and batch_size = 1), so I can actually stop at 3150 steps?
I have understand why my experiment have more than 10w+ steps. My trl version is a different version, causing max_length to not work. Because in my trl version, it should be max_seq_length. After fixing that errors, my total step is 3219 steps.