Why not FP8 with static and per-tensor quantization?
Thanks a lot. I found that the config.json and recipe.yaml shows its dynamic FP8 quantization, I have following questions:
- Why not static and per-tensor?
- The ignore_layers_list showed in recipe.yamlis as follows. why these're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj'layers is ignored?
ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector',
        're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj',
        're:.*feed_forward.down_proj']
Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?
We are following the standard set by https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 for now. I do think per-channel and per-token are needed to preserve accuracy for this model. We are exploring more aggressive quantization ablations right now, but this is what we wanted to push first.
Thanks a lot. I found that the
config.jsonandrecipe.yamlshows its dynamic FP8 quantization, I have following questions:
- Why not static and per-tensor?
- The ignore_layers_list showed in
recipe.yamlis as follows. why these're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj'layers is ignored?ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector', 're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj']Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?
Were you able to successfully compile this build yet, and was it nominal to say the least?

