Flash attention 2 error, could you kindly provide solution

#13

by k1-m - opened Apr 19

k1-m

Apr 19

•

model_name = "ai4bharat/indic-parler-tts"
model = ParlerTTSForConditionalGeneration.from_pretrained(
    model_name,
    attn_implementation="flash_attention_2"  # <-- Enable Flash Attention 2
)

gave errors:

\default\Lib\site-packages\transformers\modeling_utils.py", line 1617, in _autoset_attn_implementation
cls._check_and_enable_flash_attn_2(
File "<..>\Lib\site-packages\transformers\modeling_utils.py", line 1736, in _check_and_enable_flash_attn_2
raise ValueError(
ValueError: T5EncoderModel does not support Flash Attention 2.0 yet.

k1-m

Apr 19

•

edited Apr 19

windows 11 with torch2.6.0, transformers and other dependencies already installed with cuda 12.6 present, tested GPU to work with transformers (tested a different script)
Unfortunately without flash attention 2 (which is expected to bring orders of magnitude speedup), it is taking quite long on Rtx3080 GPU for even 3 lines of telugu text,

Need to use flash attention 2 to get the maximum speedup to be able to process text file containing 100s of lines, so resolving this flash attention 2 error would be very helpful
Thank you in advance

AshwinSankar

AI4Bharat org Jun 6

T5 models do not support flash attention, so you need to specify only the decoder to use flash-attention.
Eg.

attn_implementation={"decoder": model_args.attn_implementation, "text_encoder": "eager"}

AshwinSankar changed discussion status to closed Jun 6

k1-m

Jun 7

•

edited Jun 7

Then can the below misleading entry be removed from indic-parler-tts page[snippet pasted below] so that it helps others while choosing this model and planning to use performance tips/guidelines:
"Tips::
We've set up an Guide inference( https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md ) to make generation faster. Think SDPA, Torch.Comple, Batching and Streaming!"

Notes on above tips:

SDPA as suggested does not work (from above reply , it is unsupported)
attention implementation flash attention also not supported

As Speed/Time means a lot , accuracy in documentation about performance will help many from model selection perspective, and from time impact perspective as well

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment