Quark Team MXFP4 Llama-2-70b Model Overview

Model Information For MLPerf

  • Model Name: meta-llama/Llama-2-70b-chat-hf
  • Version: MLPerf v5.1
  • Commit: Close Division Commit
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • Operating System: Linux
  • ROCm: 7.0
  • vLLM: 0.8.5
  • Transformers: 4.51.0
  • Quark: 0.9

Calibration Dataset

This model was built with Meta Llama by applying AMD-Quark for MXFP4 quantization. The calibration dataset consists of 1000 processed samples provided by mlcommons/inference.

Quantized Tensors

The following tensors are quantized in each decoder:

  • Weights: OCP MXFP4, Static
  • Activations: OCP MXFP4, Dynamic
  • KV Cache Entries: OCP FP8, Static

Ignored Layers

The following layers are ignored during quantization:

  • lm_head

Algorithms

GPTQ algorithm is applied in weight quantization for better performance.

Quantization Scripts

cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR=“meta-llama/Llama-2-70b-chat-hf”
OUTPUT_DIR=“amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ”
DATASET=“./mlperf_data/open_orca_gpt4_tokenized_llama.calibration_1000.pkl”

python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --output_dir "${OUTPUT_DIR}" \
                          --dataset "${DATASET}" \
                          --model_attn_implementation "sdpa" \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 1000 \
                          --multi_gpu \
                          --seq_len 1024 \
                          --exclude_layers "lm_head" \
                          --quant_algo gptq \
                          --model_export hf_format

Model Performance Comparison

Metric Baseline Accuracy MXFP4 Accuracy (%)
ROUGE-1 44.4312 44.6401 (100.47%)
ROUGE-2 22.0352 22.221 (100.84%)
ROUGE-L 28.6162 28.9798 (101.27%)

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
19
Safetensors
Model size
36.9B params
Tensor type
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ

Quantized
(11)
this model

Collection including amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ