Quark Team MXFP4 Llama-2-70b Model Overview

Model Information For MLPerf

Model Name: meta-llama/Llama-2-70b-chat-hf
Version: MLPerf v5.1
Commit: Close Division Commit
Supported Hardware Microarchitecture: AMD MI350/MI355
Operating System: Linux
ROCm: 7.0
vLLM: 0.8.5
Transformers: 4.51.0
Quark: 0.9

Calibration Dataset

This model was built with Meta Llama by applying AMD-Quark for MXFP4 quantization. The calibration dataset consists of 1000 processed samples provided by mlcommons/inference.

Quantized Tensors

The following tensors are quantized in each decoder:

Weights: OCP MXFP4, Static
Activations: OCP MXFP4, Dynamic
KV Cache Entries: OCP FP8, Static

Ignored Layers

The following layers are ignored during quantization:

lm_head

Algorithms

GPTQ algorithm is applied in weight quantization for better performance.

Quantization Scripts

cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR=“meta-llama/Llama-2-70b-chat-hf”
OUTPUT_DIR=“amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ”
DATASET=“./mlperf_data/open_orca_gpt4_tokenized_llama.calibration_1000.pkl”

python3 quantize_quark.py --model_dir $MODEL_DIR \
                          --output_dir "${OUTPUT_DIR}" \
                          --dataset "${DATASET}" \
                          --model_attn_implementation "sdpa" \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 1000 \
                          --multi_gpu \
                          --seq_len 1024 \
                          --exclude_layers "lm_head" \
                          --quant_algo gptq \
                          --model_export hf_format

Model Performance Comparison

Metric	Baseline Accuracy	MXFP4 Accuracy (%)
ROUGE-1	44.4312	44.6401 (100.47%)
ROUGE-2	22.0352	22.221 (100.84%)
ROUGE-L	28.6162	28.9798 (101.27%)

License

Downloads last month: 19

Safetensors

Model size

36.9B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ

Base model

meta-llama/Llama-2-70b-chat-hf

Quantized

(11)

this model

Collection including amd/Llama-2-70b-chat-hf-WMXFP4-AMXFP4-KVFP8-Scale-UINT8-MLPerf-GPTQ

Quark Quantized MXFP4 models

Collection

26 items • Updated 27 days ago