The reduced self attention dimensions are top notch

by owao - opened about 20 hours ago

owao

about 20 hours ago

•

Fitting 56K context length vs only 16K for qwen3 on a 24GB GPU, that's game changer!
Thanks for the choice! And thanks for your model!

owao

about 19 hours ago

•

edited about 8 hours ago

By the way, what about flash attention? Does it make a difference?
Edit: I don't know how I missed you were actually suggesting it in your serving example

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment