GLM-4.6-FP8 - 55 tokens/sec on 4x RTX 6000 PRO
Hello,
I'm getting 55 tokens/sec with sglang using triton for the FP8 version
docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host lmsysorg/sglang:b200-cu129 bash
NCCL_P2P_LEVEL=4 NCCL_DEBUG=INFO PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=0 python -m sglang.launch_server --model /mnt/GLM-4.6-FP8/ --tp 4 --host 0.0.0.0 --port 4999 --mem-fraction-static 0.96 --context-length 200000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 8092 --enable-mixed-chunk --cuda-graph-max-bs 16 --kv-cache-dtype fp8_e5m2 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
look for the missing .json and copy voipmonitor.org/sm120.json to it (typically something like E=128,N=704,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,dtype=fp8_w8a8.json)
question - I suppose NVFP4 is still not implemented anywhere?
I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.
I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.
vllm FP8 uses cutlass which is not that fast as triton fp8 implementation for sm120. I have enabled triton path for fp8 but you have to set USE_TRITON_W8A8_FP8_KERNEL
Here is NVFP4 quant: https://huggingface.co/RESMP-DEV/GLM-4.6-NVFP4
the problem is that I cant find any inference engine supporting NVFP4 block scale on sm120
I believe that we should get double the speed once used nvfp4 natively - if we have 55 with fp8, we should get 110 for nvfp4
EAGLE MTP for GLM-4.6 is memory bound and is slower than not using it. but thats different with FP8 GLM-4.5-Air-FP8 - the eagle works - I'm getting 180 tokens/sec on 4 cards.
USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=false PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m sglang.launch_server --model /mnt/GLM-4.5-Air-FP8/ --tp 4 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --host 0.0.0.0 --port 5000 --mem-fraction-static 0.80 --context-length 128000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 64736 --enable-mixed-chunk --cuda-graph-max-bs 1024 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
festr2 does this AWQ quant run on your rtx 6000s? im getting some errors. and the other AWQ by QuantTrio loads but is not outputing rerasoning and stopping generations properly. in sglang maybe ill have better luck in vllm on these awqs?
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
you mean FP8?
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
you mean FP8?
fp8 i did in sglang and works great! i meant this awq doesnt work for me in sglang but does load up and work great in vllm