How to use fp8 mat-mul?

by Muawiz - opened 20 days ago

20 days ago

FP8 scale weights usually convert to base precision for inference, how can we utilize fp8 mat mul to directly benefit from GPUs supporting this like H100s?
currently I am getting
ValueError: The model is a scaled fp8 model, please set quantization to '_scaled'

eepos

20 days ago

WanVideo Model Loader -> quantization.
If it's not there, update the wrapper nodes.

NielsGx

16 days ago

WanVideo Model Loader -> quantization.
If it's not there, update the wrapper nodes.

Do any of these change the precision used for the inference itself ?
I guess using FP8 for the inference would save a bit of vram, but degrade quality a lot...?

Muawiz

16 days ago

These fp8 quantization are just size based reducing safetensors size.
Wanted to try fp8 inference, All I got was black screen when i quantized weights to fp8 using optimum quanto and did inference on fp8.

Kijai

Owner 16 days ago

I have not implemented fp8 matmul with the scaled models yet. In 2.1 fp8 matmul had huge quality impact so it wasn't worth using, it's better in 2.2 so I need to look at this when I got the time.

Also should note that fp16_fast aka full fp16 accumulation already makes the Linear ops ~20% faster, so fp8 matmul, while faster, isn't that huge improvement over it as long as it's feasible to use fp16 as the base precision.

Muawiz

6 days ago

Hey thanks,
I would love to help if you need fp8 matmul.

Kijai

Owner 6 days ago

•

edited 6 days ago

I forgot to mention it here, but I've had fp8 matmul working with the scaled models for a week or so. The downside of using that is the need to merge LoRAs as I couldn't find a way to use unmerged LoRAs with fp8 since normal multiply operation isn't something supported in fp8, only matrix multiplication.

eepos

6 days ago

What is the upside of unmerged LoRAs anyway?

Kijai

Owner 6 days ago

What is the upside of unmerged LoRAs anyway?

LoRA loading and switching is instant
Allows changing LoRA weight on the fly, in the wrapper you can give a list of floats as LoRA strengths that can be different for each step
LoRAs work a bit better as they can be used at their full weight instead downcasting to fp8

GGUF also has to run without merging, with both ComfyUI-GGUF nodes and the wrapper.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment