Very slow on T4 instance
I just tried this fp8 model on a T4 instance, it loads but training runs very very slow.
steps: 1%|β | 7/800 [03:17<6:11:57, 28.14s/it, avr_loss=0.305]
Is that normal?
T4 doesn't support bf16, so if you use bf16 or bf16 mixed precision which is required as fp16 produces NaN. But if you set it to bf16 it will convert it to FP32 every time it does a calculation. Use L4 which supports bf16.
Thanks, the fp8 model worked with L4, ETA is 50 minutes this time.
@rockerBOO I did another test on L40S, the fp8 and fp16 model have similar completion time, 17 min vs 18 min, is that normal? Should I expect a performance boost on the fp8 version?
Depends if you are using mixed precision, as usually you'd be coming from fp32 and mixed precision to do it in bf16 or fp16 so a performance increase on the calculations. But with fp8 and doing the calculations at bf16, you're doing it at a higher precision. Would need to do mixed precision at fp8 which is a little more involved and requires third party libraries to do so.
@rockerBOO Have you tried to run flux fp8 on comfyui?
On L40 I tried to run comfyui with flux fp8 version, but it still tries to cast to fp16, any ideas?
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
I already have added --fp8_e4m3fn-unet
to comfyui cli args, but it still tries to cast it.
well its casting it to bfloat16 so thats fine. The casting is related to when it does the calculations as doing them in fp8 can be problematic unless you have some other libraries that support it better.