The reduced self attention dimensions are top notch

#5
by owao - opened

Fitting 56K context length vs only 16K for qwen3 on a 24GB GPU, that's game changer!
Thanks for the choice! And thanks for your model!

By the way, what about flash attention? Does it make a difference?
Edit: I don't know how I missed you were actually suggesting it in your serving example

Sign up or log in to comment