GLM-4.6-FP8 - 55 tokens/sec on 4x RTX 6000 PRO

by festr2 - opened 11 days ago

11 days ago

Hello,

I'm getting 55 tokens/sec with sglang using triton for the FP8 version

docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host lmsysorg/sglang:b200-cu129 bash

NCCL_P2P_LEVEL=4 NCCL_DEBUG=INFO PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=0 python -m sglang.launch_server --model /mnt/GLM-4.6-FP8/ --tp 4 --host 0.0.0.0 --port 4999 --mem-fraction-static 0.96 --context-length 200000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 8092 --enable-mixed-chunk --cuda-graph-max-bs 16 --kv-cache-dtype fp8_e5m2 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

look for the missing .json and copy voipmonitor.org/sm120.json to it (typically something like E=128,N=704,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,dtype=fp8_w8a8.json)

question - I suppose NVFP4 is still not implemented anywhere?

bullpoint

Owner 11 days ago

I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.

festr2

11 days ago

I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.

vllm FP8 uses cutlass which is not that fast as triton fp8 implementation for sm120. I have enabled triton path for fp8 but you have to set USE_TRITON_W8A8_FP8_KERNEL

Here is NVFP4 quant: https://huggingface.co/RESMP-DEV/GLM-4.6-NVFP4
the problem is that I cant find any inference engine supporting NVFP4 block scale on sm120
I believe that we should get double the speed once used nvfp4 natively - if we have 55 with fp8, we should get 110 for nvfp4

EAGLE MTP for GLM-4.6 is memory bound and is slower than not using it. but thats different with FP8 GLM-4.5-Air-FP8 - the eagle works - I'm getting 180 tokens/sec on 4 cards.

USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=false PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m sglang.launch_server --model /mnt/GLM-4.5-Air-FP8/ --tp 4 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --host 0.0.0.0 --port 5000 --mem-fraction-static 0.80 --context-length 128000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 64736 --enable-mixed-chunk --cuda-graph-max-bs 1024 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

Fernanda24

7 days ago

festr2 does this AWQ quant run on your rtx 6000s? im getting some errors. and the other AWQ by QuantTrio loads but is not outputing rerasoning and stopping generations properly. in sglang maybe ill have better luck in vllm on these awqs?

Fernanda24

7 days ago

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

festr2

6 days ago

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

you mean FP8?

Fernanda24

6 days ago

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

you mean FP8?

fp8 i did in sglang and works great! i meant this awq doesnt work for me in sglang but does load up and work great in vllm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment