VLLM, SGLANG
has anyone been able to deploy this AWQ with either SGLANG or VLLM? If so please provide the ver / PR etc. Thank you
vllm works fine with good quality. Recommend to use.
vllm works fine with good quality. Recommend to use.
which version of vllm works fine for you ?
how fast is this running with 48gb of vram on vllm/sglang?
I got this to work with vllm 0.8.5 on 8x A10 cards. It required providing a fused_moe config file at "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10,dtype=int4_w4a16.json" as there are only configs for H series cards and the default settings cannot work without a config. This config file can be mostly generated using the benchmark_moe.py script from the vllm project.
I got this to work with vllm 0.8.5 on 8x A10 cards. It required providing a fused_moe config file at "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10,dtype=int4_w4a16.json" as there are only configs for H series cards and the default settings cannot work without a config. This config file can be mostly generated using the benchmark_moe.py script from the vllm project.
When you say provide the config, how do you generate the config for int4_w4a16 dtype? There is no option for it on the benchmark_moe.py script.
Initially I generated the config without specifying a dtype as I used it for the unquantized model released by Qwen. As you pointed out, there is no option for this dtype in the benchmark_moe.py script, so I re-used that config file generated without specifying a dtype by appending the "dtype=int4_w4a16.json" to the json filename.