Not working with latest VLLM / Flashinfer

#10

by stev236 - opened 16 days ago

16 days ago

Great little model. These hybrid models (Jamba, Granite 4 H, and Qwen3 Next) are clearly the future.
Unfortunately, the latest version of vllm generates nonsense with this model if using the flashinfer backend (0.4 up).

Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.

See https://github.com/vllm-project/vllm/issues/26936

Anybody else noticed that?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment