Quark Team FP8 Llama-2-70b-chat-hf Model Overview

Model Information For MLPerf

  • Model Name: Llama-2-70b-chat-hf
  • Version: MLPerf v5.0
  • Commit: Close Division Commit

Calibration Dataset

The calibration dataset consists of 1000 OpenOcra samples provided by MLPerf

Quantized Tensors

The following tensors are quantized in each decoder:

  • MLP Layer inputs and weights
  • Linear (including QKVO linear) layer inputs and weights
  • KV Cache Entries

Ignored Layers

The following layers are ignored during quantization:

  • lm_head

Model Performance Comparison

Metric Baseline Accuracy Target (%) FP8 Quant Accuracy (%)
Open Orca (Chat)
- Rouge1 44.4312 44.6369
- Rouge2 22.0352 22.1798
- RougeL 28.6162 28.8249

License

Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.

Downloads last month
20
Safetensors
Model size
69B params
Tensor type
FP16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Llama-2-70b-chat-hf_FP8_MLPerf_V2

Quantized
(10)
this model

Collection including amd/Llama-2-70b-chat-hf_FP8_MLPerf_V2