How to use fp8 mat-mul?

#7
by Muawiz - opened

FP8 scale weights usually convert to base precision for inference, how can we utilize fp8 mat mul to directly benefit from GPUs supporting this like H100s?
currently I am getting
ValueError: The model is a scaled fp8 model, please set quantization to '_scaled'

WanVideo Model Loader -> quantization.
If it's not there, update the wrapper nodes.

Screenshot 2025-08-04 at 19-45-15 WanVideo2_2_I2V_00100 - ComfyUI.png

WanVideo Model Loader -> quantization.
If it's not there, update the wrapper nodes.

Screenshot 2025-08-04 at 19-45-15 WanVideo2_2_I2V_00100 - ComfyUI.png

Do any of these change the precision used for the inference itself ?
I guess using FP8 for the inference would save a bit of vram, but degrade quality a lot...?

These fp8 quantization are just size based reducing safetensors size.
Wanted to try fp8 inference, All I got was black screen when i quantized weights to fp8 using optimum quanto and did inference on fp8.

I have not implemented fp8 matmul with the scaled models yet. In 2.1 fp8 matmul had huge quality impact so it wasn't worth using, it's better in 2.2 so I need to look at this when I got the time.

Also should note that fp16_fast aka full fp16 accumulation already makes the Linear ops ~20% faster, so fp8 matmul, while faster, isn't that huge improvement over it as long as it's feasible to use fp16 as the base precision.

Hey thanks,
I would love to help if you need fp8 matmul.

I forgot to mention it here, but I've had fp8 matmul working with the scaled models for a week or so. The downside of using that is the need to merge LoRAs as I couldn't find a way to use unmerged LoRAs with fp8 since normal multiply operation isn't something supported in fp8, only matrix multiplication.

What is the upside of unmerged LoRAs anyway?

What is the upside of unmerged LoRAs anyway?

  • LoRA loading and switching is instant
  • Allows changing LoRA weight on the fly, in the wrapper you can give a list of floats as LoRA strengths that can be different for each step
  • LoRAs work a bit better as they can be used at their full weight instead downcasting to fp8

GGUF also has to run without merging, with both ComfyUI-GGUF nodes and the wrapper.

Sign up or log in to comment