rtx
hello, I know you wrote untested on rtx - but any hints how to run this on rtx - which inference engine?
I would use TensorRT-LLM as we have noted odd behavior on vLLM. Please report back if it works for you!
I would use TensorRT-LLM as we have noted odd behavior on vLLM. Please report back if it works for you!
would you mind to share your command line options how you run it trt-llm? I'm getting some errors so I want to be sure I'm running it correctly
So currently we build from source with some modifications to the modeling code to get this to work. It's pretty intensive if you are not deeply familair with cmake and such and modifying inference libararies. Once you have it built from source the main problem I can see is that SM100(Server Blackwell) and SM120(RTX 6000 Pro/5090) have slightly different GEMMs that may not even exist for SM120. If you somehow manage to get it all working it's simply trtllm-serve RESMP-DEV/GLM-4.6-NVFP4 --tp_size 4 --pp_size 2
for 8 GPU's for fastest inference for us so far but the TP and PP size's may be tuned.
anything you can share/release what you have modified so I can compile it?
p.s.: I'm trying to implement nvfp4 in vllm using triton recent PRs (not merged yet), but no success yet.