Not working with latest VLLM / Flashinfer
#10
by
stev236
- opened
Great little model. These hybrid models (Jamba, Granite 4 H, and Qwen3 Next) are clearly the future.
Unfortunately, the latest version of vllm generates nonsense with this model if using the flashinfer backend (0.4 up).
Switching the backend to Flash_attn solves the problem, but unfortunately that backend doesn't support KV-Cache FP8 quantization.
See https://github.com/vllm-project/vllm/issues/26936
Anybody else noticed that?