alexmarques commited on
Commit
09b3210
·
verified ·
1 Parent(s): 75626e8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +376 -0
README.md ADDED
@@ -0,0 +1,376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ - fr
6
+ - it
7
+ - pt
8
+ - hi
9
+ - es
10
+ - th
11
+ - de
12
+ base_model:
13
+ - meta-llama/Llama-3.1-70B-Instruct
14
+ tags:
15
+ - facebook
16
+ - meta
17
+ - pytorch
18
+ - llama
19
+ - llama-3
20
+ - int4
21
+ - quantized
22
+ license: llama3.3
23
+ ---
24
+
25
+ # Llama-3.3-70B-Instruct-quantized.w4a16
26
+
27
+ ## Model Overview
28
+ - **Model Architecture:** Meta-Llama-3.1
29
+ - **Input:** Text
30
+ - **Output:** Text
31
+ - **Model Optimizations:**
32
+ - **Weight quantization:** INT4
33
+ - **Intended Use Cases:** Intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.3 model also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.3 Community License allows for these use cases.
34
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.3 Community License. Use in languages beyond English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
35
+ - **Release Date:** 12/11/2024
36
+ - **Version:** 1.0
37
+ - **License(s):** llama3.3
38
+ - **Model Developers:** RedHat (Neural Magic)
39
+
40
+ ### Model Optimizations
41
+
42
+ This model was obtained by quantizing the weights of [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) to INT4 data type.
43
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
44
+
45
+ Only the weights of the linear operators within transformers blocks are quantized.
46
+ Weights are quantized using a symmetric per-group scheme, with group size 128.
47
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
48
+
49
+ ## Deployment
50
+
51
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
52
+
53
+ ```python
54
+ from vllm import LLM, SamplingParams
55
+ from transformers import AutoTokenizer
56
+
57
+ model_id = "RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16"
58
+ number_gpus = 1
59
+
60
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
61
+
62
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
63
+
64
+ messages = [
65
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
66
+ {"role": "user", "content": "Who are you?"},
67
+ ]
68
+
69
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
70
+
71
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
72
+
73
+ outputs = llm.generate(prompt, sampling_params)
74
+
75
+ generated_text = outputs[0].outputs[0].text
76
+ print(generated_text)
77
+ ```
78
+
79
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
80
+
81
+ ## Creation
82
+
83
+ <details>
84
+ <summary>Creation details</summary>
85
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
86
+
87
+
88
+ ```python
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+ from llmcompressor.modifiers.quantization import GPTQModifier
91
+ from llmcompressor.transformers import oneshot
92
+ from datasets import load_dataset
93
+
94
+ # Load model
95
+ model_stub = "meta-llama/Llama-3.3-70B-Instruct"
96
+ model_name = model_stub.split("/")[-1]
97
+
98
+ num_samples = 1024
99
+ max_seq_len = 8192
100
+
101
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
102
+
103
+ model = AutoModelForCausalLM.from_pretrained(
104
+ model_stub,
105
+ device_map="auto",
106
+ torch_dtype="auto",
107
+ )
108
+
109
+ def preprocess_fn(example):
110
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
111
+
112
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
113
+ ds = ds.map(preprocess_fn)
114
+
115
+ # Configure the quantization algorithm and scheme
116
+ recipe = GPTQModifier(
117
+ targets="Linear",
118
+ scheme="W4A16",
119
+ ignore=["lm_head"],
120
+ sequential_targets=["LlamaDecoderLayer"],
121
+ dampening_frac=0.01,
122
+ )
123
+
124
+ # Apply quantization
125
+ oneshot(
126
+ model=model,
127
+ dataset=ds,
128
+ recipe=recipe,
129
+ max_seq_length=max_seq_len,
130
+ num_calibration_samples=num_samples,
131
+ )
132
+
133
+ # Save to disk in compressed-tensors format
134
+ save_path = model_name + "-quantized.w4a16"
135
+ model.save_pretrained(save_path)
136
+ tokenizer.save_pretrained(save_path)
137
+ print(f"Model and tokenizer saved to: {save_path}")
138
+ ```
139
+ </details>
140
+
141
+ ## Evaluation
142
+
143
+ This model was evaluated on the well-known OpenLLM v1, HumanEval, and HumanEval+ benchmarks.
144
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine.
145
+
146
+ OpenLLM v1 evaluations were conducted using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the prompting style of [Meta-Llama-3.1-Instruct-evals](https://huggingface.co/datasets/meta-llama/Meta-Llama-3.1-8B-Instruct-evals) when available.
147
+
148
+ HumanEval and HumanEval+ evaluations were conducted using Neural Magic's fork of the [EvalPlus](https://github.com/neuralmagic/evalplus) repository.
149
+
150
+ <details>
151
+ <summary>Evaluation details</summary>
152
+
153
+ **MMLU**
154
+ ```
155
+ lm_eval \
156
+ --model vllm \
157
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3850,max_gen_toks=10,tensor_parallel_size=1 \
158
+ --tasks mmlu_llama \
159
+ --fewshot_as_multiturn \
160
+ --apply_chat_template \
161
+ --num_fewshot 5 \
162
+ --batch_size auto
163
+ ```
164
+
165
+ **MMLU-CoT**
166
+ ```
167
+ lm_eval \
168
+ --model vllm \
169
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4064,max_gen_toks=1024,tensor_parallel_size=1 \
170
+ --tasks mmlu_cot_llama \
171
+ --apply_chat_template \
172
+ --num_fewshot 0 \
173
+ --batch_size auto
174
+ ```
175
+
176
+ **ARC-Challenge**
177
+ ```
178
+ lm_eval \
179
+ --model vllm \
180
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=3940,max_gen_toks=100,tensor_parallel_size=1 \
181
+ --tasks arc_challenge_llama \
182
+ --apply_chat_template \
183
+ --num_fewshot 0 \
184
+ --batch_size auto
185
+ ```
186
+
187
+ **GSM-8K**
188
+ ```
189
+ lm_eval \
190
+ --model vllm \
191
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
192
+ --tasks gsm8k_llama \
193
+ --fewshot_as_multiturn \
194
+ --apply_chat_template \
195
+ --num_fewshot 8 \
196
+ --batch_size auto
197
+ ```
198
+
199
+ **Hellaswag**
200
+ ```
201
+ lm_eval \
202
+ --model vllm \
203
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
204
+ --tasks hellaswag \
205
+ --num_fewshot 10 \
206
+ --batch_size auto
207
+ ```
208
+
209
+ **Winogrande**
210
+ ```
211
+ lm_eval \
212
+ --model vllm \
213
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
214
+ --tasks winogrande \
215
+ --num_fewshot 5 \
216
+ --batch_size auto
217
+ ```
218
+
219
+ **TruthfulQA**
220
+ ```
221
+ lm_eval \
222
+ --model vllm \
223
+ --model_args pretrained="RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1 \
224
+ --tasks truthfulqa \
225
+ --num_fewshot 0 \
226
+ --batch_size auto
227
+ ```
228
+
229
+ **HumanEval and HumanEval+**
230
+ *Generation*
231
+ ```
232
+ python3 codegen/generate.py \
233
+ --model RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 \
234
+ --bs 16 \
235
+ --temperature 0.2 \
236
+ --n_samples 50 \
237
+ --root "." \
238
+ --dataset humaneval
239
+ ```
240
+
241
+ *Sanitization*
242
+ ```
243
+ python3 evalplus/sanitize.py \
244
+ humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2
245
+ ```
246
+
247
+ *Evaluation*
248
+ ```
249
+ evalplus.evaluate \
250
+ --dataset humaneval \
251
+ --samples humaneval/RedHatAI--Llama-3.3-70B-Instruct-quantized.w4a16_vllm_temp_0.2-sanitized
252
+ ```
253
+ </details>
254
+
255
+ ### Accuracy
256
+
257
+ <table>
258
+ <tr>
259
+ <th>Category
260
+ </th>
261
+ <th>Benchmark
262
+ </th>
263
+ <th>Llama-3.3-70B-Instruct
264
+ </th>
265
+ <th>Llama-3.3-70B-Instruct-quantized.w4a16<br>(this model)
266
+ </th>
267
+ <th>Recovery
268
+ </th>
269
+ </tr>
270
+ <tr>
271
+ <td rowspan="8" ><strong>OpenLLM v1</strong>
272
+ </td>
273
+ <td>MMLU (5-shot)
274
+ </td>
275
+ <td>81.60
276
+ </td>
277
+ <td>80.62
278
+ </td>
279
+ <td>98.8%
280
+ </td>
281
+ </tr>
282
+ <tr>
283
+ <td>MMLU (CoT, 0-shot)
284
+ </td>
285
+ <td>86.58
286
+ </td>
287
+ <td>85.81
288
+ </td>
289
+ <td>99.1%
290
+ </td>
291
+ </tr>
292
+ <tr>
293
+ <td>ARC Challenge (0-shot)
294
+ </td>
295
+ <td>49.23
296
+ </td>
297
+ <td>49.49
298
+ </td>
299
+ <td>100.5%
300
+ </td>
301
+ </tr>
302
+ <tr>
303
+ <td>GSM-8K (CoT, 8-shot, strict-match)
304
+ </td>
305
+ <td>94.16
306
+ </td>
307
+ <td>94.47
308
+ </td>
309
+ <td>100.3%
310
+ </td>
311
+ </tr>
312
+ <tr>
313
+ <td>Hellaswag (10-shot)
314
+ </td>
315
+ <td>86.49
316
+ </td>
317
+ <td>85.97
318
+ </td>
319
+ <td>99.4%
320
+ </td>
321
+ </tr>
322
+ <tr>
323
+ <td>Winogrande (5-shot)
324
+ </td>
325
+ <td>84.77
326
+ </td>
327
+ <td>
328
+ </td>
329
+ <td>%
330
+ </td>
331
+ </tr>
332
+ <tr>
333
+ <td>TruthfulQA (0-shot, mc2)
334
+ </td>
335
+ <td>62.75
336
+ </td>
337
+ <td>61.66
338
+ </td>
339
+ <td>98.3%
340
+ </td>
341
+ </tr>
342
+ <tr>
343
+ <td><strong>Average</strong>
344
+ </td>
345
+ <td><strong>77.94</strong>
346
+ </td>
347
+ <td><strong>77.49</strong>
348
+ </td>
349
+ <td><strong>98.3%</strong>
350
+ </td>
351
+ </tr>
352
+ <tr>
353
+ <td rowspan="2" ><strong>Coding</strong>
354
+ </td>
355
+ <td>HumanEval pass@1
356
+ </td>
357
+ <td>83.20
358
+ </td>
359
+ <td>83.40
360
+ </td>
361
+ <td>100.2%
362
+ </td>
363
+ </tr>
364
+ <tr>
365
+ <td>HumanEval+ pass@1
366
+ </td>
367
+ <td>78.40
368
+ </td>
369
+ <td>78.60
370
+ </td>
371
+ <td>100.3%
372
+ </td>
373
+ </tr>
374
+ </table>
375
+
376
+