Ithanil commited on
Commit
1885b6f
·
verified ·
1 Parent(s): 926d02c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: >-
5
+ https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
+ base_model:
7
+ - nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
8
+ ---
9
+
10
+ # Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic
11
+
12
+ SmoothQuant/GPTQ W8A8 quantization of https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
13
+
14
+ ## Creation
15
+
16
+ Created with llmcompressor using the following code:
17
+
18
+ ```
19
+ import torch
20
+ from transformers import AutoTokenizer, AutoModelForCausalLM
21
+ from datasets import load_dataset
22
+ from llmcompressor import oneshot
23
+ from llmcompressor.modifiers.quantization import GPTQModifier
24
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
25
+ from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
26
+ import random
27
+
28
+ # Config
29
+ MODEL_ID = "/models/Llama-3_1-Nemotron-Ultra-253B-v1"
30
+ SAVE_DIR = "/models/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic"
31
+ NUM_CALIBRATION_SAMPLES = 1024
32
+ MAX_SEQUENCE_LENGTH = 4096
33
+
34
+ # Load model
35
+ device_map = calculate_offload_device_map(
36
+ MODEL_ID, num_gpus=8, reserve_for_hessians=False, torch_dtype="auto", trust_remote_code=True,
37
+ )
38
+ print(device_map)
39
+ model = AutoModelForCausalLM.from_pretrained(
40
+ MODEL_ID, device_map=device_map, torch_dtype="auto", trust_remote_code=True,
41
+ )
42
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
43
+
44
+ # Load and preprocess the dataset
45
+ ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
46
+ ds = ds.shuffle(seed=1337).select(range(NUM_CALIBRATION_SAMPLES))
47
+
48
+ def add_system_prompt(messages):
49
+ options = ["on", "off"]
50
+ thinking = random.choice(options)
51
+ return [{"content": f"detailed thinking {thinking}", "role": "system"}] + messages
52
+
53
+ def preprocess(example):
54
+ return {"text": tokenizer.apply_chat_template(add_system_prompt(example["messages"]), tokenize=False)}
55
+ ds = ds.map(preprocess)
56
+
57
+ def tokenize(sample):
58
+ return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
59
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
60
+
61
+ # Configure the quantization algorithms
62
+ recipe = [
63
+ SmoothQuantModifier(smoothing_strength=0.8),
64
+ GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*125.*", "re:.*134.*", "re:.*143.*", "re:.*149.*"], dampening_frac=0.01, offload_hessians=False),
65
+ ]
66
+
67
+ # Apply quantization
68
+ oneshot(
69
+ model=model,
70
+ dataset=ds,
71
+ recipe=recipe,
72
+ max_seq_length=MAX_SEQUENCE_LENGTH,
73
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
74
+ trust_remote_code_model=True
75
+ )
76
+
77
+ # Save the compressed model
78
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
79
+ tokenizer.save_pretrained(SAVE_DIR)
80
+ ```
81
+
82
+ **Note** that Layers 125, 134, 143 and 149 had to be **excluded** from GPTQ quantization, because their extreme size would lead to allocations of 600+GB Heassian matrices for GPTQ (which couldn't be offloaded for some reason).
83
+ Furthermore, the GPU memory allocation code in calculate_offload_device_map() was adjusted.
84
+
85
+ ## Evaluation
86
+
87
+ ### GSM8K (3 Runs)
88
+
89
+ #### Original
90
+ |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
91
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
92
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9469|± |0.0062|
93
+ | | |strict-match | 5|exact_match|↑ |0.9462|± |0.0062|
94
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
95
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9424|± |0.0064|
96
+ | | |strict-match | 5|exact_match|↑ |0.9401|± |0.0065|
97
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
98
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9454|± |0.0063|
99
+ | | |strict-match | 5|exact_match|↑ |0.9454|± |0.0063|
100
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
101
+ |Avg: | 3|flexible-extract| 5|exact_match|↑ |0.9449|± |0.0036|
102
+ | | |strict-match | 5|exact_match|↑ |0.9439|± |0.0037|
103
+
104
+ #### Quantized
105
+ |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
106
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
107
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9431|± |0.0064|
108
+ | | |strict-match | 5|exact_match|↑ |0.9393|± |0.0066|
109
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
110
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9538|± |0.0058|
111
+ | | |strict-match | 5|exact_match|↑ |0.9500|± |0.0060|
112
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
113
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9477|± |0.0061|
114
+ | | |strict-match | 5|exact_match|↑ |0.9462|± |0.0062|
115
+ |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
116
+ |Avg. | 3|flexible-extract| 5|exact_match|↑ |0.9482|± |0.0035|
117
+ | | |strict-match | 5|exact_match|↑ |0.9452|± |0.0036|
118
+
119
+ ### simple-evals (10x50 Samples each)
120
+
121
+ Using custom fork of OpenAI's simple-evals benchmark suite: https://github.com/Ithanil/simple-evals/tree/custom
122
+
123
+ These were run using the chat template as well as Nvidias suggested settings:
124
+ - Reasoning Off: Greedy (`temperature=0`), system prompt: `detailed thinking off`
125
+ - Reasoning On: `temperature=0.6`, `top_p=0.95`, system prompt: `detailed thinking on`
126
+
127
+ #### Original (Reasoning Off)
128
+
129
+ | Benchmark | Average Score | Standard Error |
130
+ |-------------|-----------------|------------------|
131
+ | DROP (F1) | 92.6556 | 0.711437 |
132
+ | GPQA | 43.2 | 2.04831 |
133
+ | HumanEval | 85.6 | 0.37238 |
134
+ | MGSM | 90.9091 | 1.40836 |
135
+ | MMLU | 84.6 | 0.6 |
136
+
137
+ #### Quantized (Reasoning Off)
138
+
139
+ | Benchmark | Average Score | Standard Error |
140
+ |-------------|-----------------|------------------|
141
+ | DROP (F1) | 91.2381 | 0.843284 |
142
+ | GPQA | 43.2 | 0.997775 |
143
+ | HumanEval | 85.08 | 0.430194 |
144
+ | MGSM | 92.9091 | 0.994013 |
145
+ | MMLU | 82.8 | 1.04137 |
146
+
147
+ i.e. all quantized evals are within statistical error of original model's evals.
148
+
149
+ #### Quantized (Reasoning On)
150
+ For completeness, here also results for **Reasoning ON**:
151
+
152
+ | Benchmark | Average Score | Standard Error |
153
+ |-------------|-----------------|------------------|
154
+ | DROP (F1) | 89.8326 | 1.14615 |
155
+ | GPQA | 61.2 | 1.81842 |
156
+ | HumanEval | 93 | 0.181353 |
157
+ | MGSM | 94.9091 | 0.931048 |
158
+ | MMLU | 85.2 | 0.8 |