Quark Quantized MXFP4 models
Collection
25 items
•
Updated
This model was built with Meta Llama by applying AMD-Quark for MXFP4 quantization.
This model was obtained by quantizing Llama-3.3-70B-Instruct's weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in AMD-Quark.
Quantization scripts:
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir meta-llama/Llama-3.3-70B-Instruct \
--quant_scheme w_mxfp4_a_mxfp4 \
--group_size 32 \
--kv_cache_dtype fp8 \
--num_calib_data 128 \
--multi_gpu \
--quant_algo autosmoothquant
--model_export hf_format \
--output_dir amd/Llama-3.3-70B-Instruct-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-ASQ
This model can be deployed efficiently using the vLLM backend.
The model was evaluated on MMLU and GSM8K_COT. Evaluation was conducted using the framework lm-evaluation-harness and the vLLM engine.
Benchmark | Llama-3.3-70B-Instruct | Llama-3.3-70B-Instruct-MXFP4(this model) | Recovery |
MMLU (5-shot) | 83.36 | 81.43 | 97.68% |
GSM8K_COT (8-shot, strict-match) | 94.54 | 94.24 | 99.68% |
The results were obtained using the following commands:
lm_eval \
--model vllm \
--model_args pretrained=amd/Llama-3.3-70B-Instruct-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-ASQ,dtype=auto,max_gen_toks=10,add_bos_token=True,tensor_parallel_size=1,gpu_memory_utilization=0.8,max_model_len=4096,kv_cache_dtype=fp8, \
--tasks mmlu_llama \
--apply_chat_template \
--fewshot_as_multiturn \
--num_fewshot 5 \
--batch_size 32 \
--device cuda
lm_eval \
--model_args pretrained=amd/Llama-3.3-70B-Instruct-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-ASQ,dtype=auto,add_bos_token=True,tensor_parallel_size=1,gpu_memory_utilization=0.8,max_model_len=4096,kv_cache_dtype=fp8, \
--model vllm \
--tasks gsm8k_llama \
--apply_chat_template \
--fewshot_as_multiturn \
--num_fewshot 8 \
--batch_size 64 \
--device cuda
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Base model
meta-llama/Llama-3.1-70B