File size: 4,164 Bytes
1a53b9a
 
 
 
 
 
 
 
 
 
 
bd72b03
1a53b9a
 
 
a7469d3
 
 
1a53b9a
 
b65839a
1a53b9a
 
 
 
03900a5
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
adaf803
 
 
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17598e6
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: llama3.1
base_model:
- meta-llama/Llama-3.1-405B-Instruct
---

# Model Overview

- **Model Architecture:** Meta-Llama-3.1
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **Preferred Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** OCP MXFP4
  - **Activation quantization:** OCP MXFP4
  - **KV cache quantization:** OCP FP8
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

The model is the quantized version of the [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).


# Model Quantization

This model was obtained by quantizing [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir "meta-llama/Meta-Llama-3.1-405B-Instruct" \
                          --model_attn_implementation "sdpa" \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --kv_cache_dtype fp8 \
                          --quant_algo autosmoothquant \
                          --min_kv_scale 1.0 \
                          --model_export hf_format \
                          --output_dir $output_path \
                          --multi_gpu
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation

The model was evaluated on MMLU and GSM8K_COT.
Evaluation was conducted using the framework [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the vLLM engine.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
   </td>
   <td><strong>Meta-Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>MMLU (5-shot)
   </td>
   <td>87.63
   </td>
   <td>86.62
   </td>
   <td>98.85%
   </td>
  </tr>
  <tr>
   <td>GSM8K_COT (8-shot, strict-match)
   </td>
   <td>96.51
   </td>
   <td>96.06
   </td>
   <td>99.53%
   </td>
  </tr>
</table>


### Reproduction

The results were obtained using the following commands:

#### MMLU
```
lm_eval \
    --model vllm \
    --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks mmlu_llama \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 5 \
    --batch_size auto
```

#### GSM8K_COT
```
lm_eval \
    --model vllm \
    --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks gsm8k_llama \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 8 \
    --batch_size auto
```

#### License

Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.