Quark Quantized MXFP4 models
Collection
26 items
•
Updated
This model was built with Meta Llama by applying AMD-Quark for MXFP4 quantization. The calibration dataset consists of 1000 processed samples provided by mlcommons/inference.
The following tensors are quantized in each decoder:
The following layers are ignored during quantization:
lm_head
GPTQ algorithm is applied in weight quantization for better performance.
cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR=“meta-llama/Llama-2-70b-chat-hf”
OUTPUT_DIR=“amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ”
DATASET=“./mlperf_data/open_orca_gpt4_tokenized_llama.calibration_1000.pkl”
python3 quantize_quark.py --model_dir $MODEL_DIR \
--output_dir "${OUTPUT_DIR}" \
--dataset "${DATASET}" \
--model_attn_implementation "sdpa" \
--quant_scheme w_mxfp4_a_mxfp4 \
--group_size 32 \
--kv_cache_dtype fp8 \
--num_calib_data 1000 \
--multi_gpu \
--seq_len 1024 \
--exclude_layers "lm_head" \
--quant_algo gptq \
--model_export hf_format
Metric | Baseline Accuracy | MXFP4 Accuracy (%) |
---|---|---|
ROUGE-1 | 44.4312 | 44.6401 (100.47%) |
ROUGE-2 | 22.0352 | 22.221 (100.84%) |
ROUGE-L | 28.6162 | 28.9798 (101.27%) |
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
Base model
meta-llama/Llama-2-70b-chat-hf