rtx

by festr2 - opened 26 days ago

Discussion

festr2

26 days ago

hello, I know you wrote untested on rtx - but any hints how to run this on rtx - which inference engine?

Kearm

RESMP.DEV org 26 days ago

I would use TensorRT-LLM as we have noted odd behavior on vLLM. Please report back if it works for you!

festr2

26 days ago

I would use TensorRT-LLM as we have noted odd behavior on vLLM. Please report back if it works for you!

would you mind to share your command line options how you run it trt-llm? I'm getting some errors so I want to be sure I'm running it correctly

Kearm

RESMP.DEV org 25 days ago

So currently we build from source with some modifications to the modeling code to get this to work. It's pretty intensive if you are not deeply familair with cmake and such and modifying inference libararies. Once you have it built from source the main problem I can see is that SM100(Server Blackwell) and SM120(RTX 6000 Pro/5090) have slightly different GEMMs that may not even exist for SM120. If you somehow manage to get it all working it's simply trtllm-serve RESMP-DEV/GLM-4.6-NVFP4 --tp_size 4 --pp_size 2 for 8 GPU's for fastest inference for us so far but the TP and PP size's may be tuned.

festr2

25 days ago

anything you can share/release what you have modified so I can compile it?

p.s.: I'm trying to implement nvfp4 in vllm using triton recent PRs (not merged yet), but no success yet.

Fernanda24

22 days ago

So currently we build from source with some modifications to the modeling code to get this to work. It's pretty intensive if you are not deeply familair with cmake and such and modifying inference libararies. Once you have it built from source the main problem I can see is that SM100(Server Blackwell) and SM120(RTX 6000 Pro/5090) have slightly different GEMMs that may not even exist for SM120. If you somehow manage to get it all working it's simply trtllm-serve RESMP-DEV/GLM-4.6-NVFP4 --tp_size 4 --pp_size 2 for 8 GPU's for fastest inference for us so far but the TP and PP size's may be tuned.

wait wut can i run this on rtx 6000 pro blackwells now? any chance u can share the steps required, please...?

Kearm

RESMP.DEV org 21 days ago

So currently we build from source with some modifications to the modeling code to get this to work. It's pretty intensive if you are not deeply familair with cmake and such and modifying inference libararies. Once you have it built from source the main problem I can see is that SM100(Server Blackwell) and SM120(RTX 6000 Pro/5090) have slightly different GEMMs that may not even exist for SM120. If you somehow manage to get it all working it's simply trtllm-serve RESMP-DEV/GLM-4.6-NVFP4 --tp_size 4 --pp_size 2 for 8 GPU's for fastest inference for us so far but the TP and PP size's may be tuned.

wait wut can i run this on rtx 6000 pro blackwells now? any chance u can share the steps required, please...?

Sorry I was unclear. I am unable to run this on my RTX 6000 Pro Blackwell cards because of missing kernels. I am able to run it on server B200's.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment