Error when passing attention_mask, and using flash_attention_2
The model's forward method raises an error (RuntimeError: cu_seqlens_q must have shape (batch_size + 1)) when using both flash attention and attention mask as input.
The problem is in the forward method of ModernBertModel class (rows 526 to 529), where it does
if attention_mask is not None:
mask = self.mask_converter.to_4d(attention_mask, attention_mask.shape[1], inputs_embeds.dtype)
else:
mask = None
This should not be done when using flash attention, since the transformers' flash_attention_forward function expects a 2d attention mask as input. By removing that part of code, it works.
Hey @giuseppe-trimigno , I just fixed the flash. Feel free to give it a try. You might need to clear your Hugging Face cache to redownload the model.