Having trouble with the pre-training process for the v1 and v2 versions of Geneformer.
I've noticed a significant performance difference when using DeepSpeed with what I believe to be the same configuration. Specifically, training on 10 million data points* takes only 4 hours in my v1 environment, but the same task requires 12 hours in the v2 environment(4 a800 gpus ).
Is it expected that the v2 training is this much more time-consuming, or could this point to a problem in my training process? I'm wondering if factors like different library versions or training parameters could be causing this discrepancy.
Thanks for your question. The V2 model is significantly larger so this is expected.
So, to put it another way: even if I make the following parameters identical (max_input_size = 2**11, num_layers = 6, num_attn_heads = 4, embed_dim = 512, based on your pretrain_geneformer_w_deepspeed.py script), and I use the same dataset, the same type and number of GPUs—in short, after controlling for all variables—with the only change being the training environment (v1 vs. v2, understanding that the tokenizers are necessarily different), can I attribute the resulting 2x difference in training time entirely to v2 being computationally more complex?
If you are pretraining a new model, the pretraining script is unchanged and is not related to the V1 vs. V2 model change.
That clears everything up, thank you for the detailed explanation. I understand now that the increased complexity and training time in v2 is expected by design.