jinaai/xlm-roberta-flash-implementation · Train Converted XLM-RoBERTa model without FlashAttention installed

I successfully converted an XLM-RoBERTa model using the convert_roberta_weights_to_flash.py script. However, when I try to train it using Hugging Face Trainer, I get the following error:

RuntimeError: FlashAttention is not installed. To proceed with training, please install FlashAttention. For inference, you have two options: either install FlashAttention or disable it by setting use_flash_attn=False when loading the model.

My GPU does not support FlashAttention, so I want to train the model without installing it. I have already tried setting use_flash_attn=False in the config and during model loading, but the error persists.

Interestingly, when I fine-tune the jinaai/jina-embeddings-v3 model (which, according to the documentation, uses this same converted implementation), it works perfectly and can be trained without FlashAttention installed.

Question:
How can I train a converted XLM-RoBERTa model without FlashAttention installed, similar to how jinaai/jina-embeddings-v3 works? Is there a workaround or patch for this issue, or do I need to re-convert or modify something in the code to get pure PyTorch rotary embeddings for training?

Thank you for your help!