Padding of labels bug?
I'm currently reimplementing the audio training sample code using Pytorch Lightning and while debugging an issue I noticed in the collator:
labels = pad_sequence(labels_list, padding_side='left', padding_value=0)
When batching, should the labels not be padded with _IGNORE_INDEX
?
I think the attention mask will handle it.
I think it does matter when calculating the loss but the HF trainer is probably handling this case, i.e. converting pad to -100 before the loss calculation.
Yes, I also tried to understand what exactly happens here. The labels tensor arrives at the loss calculation (ForCausalLMLoss) with 0, e.g. this part. Until there, the 0's were still there and in the ForCausalLMLoss I did not see anything which ignores or changes the 0 to -100.
So in the loss calculation, the 0's are still in there. I also have seen that my model starts to learn to predict "!" at the beginning:
- PRED: ! ist der We so Trier bis zwei Jahren ratsam.
Is it really an error? Or is there perhaps some hidden mechanic somewhere, which I have missed?