Ithanil
/

Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+base_model:
+- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
+---
+# Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic
+SmoothQuant/GPTQ W8A8 quantization of https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
+## Creation
+Created with llmcompressor using the following code:
+```
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from datasets import load_dataset
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
+import random
+# Config
+MODEL_ID = "/models/Llama-3_1-Nemotron-Ultra-253B-v1"
+SAVE_DIR = "/models/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic"
+NUM_CALIBRATION_SAMPLES = 1024
+MAX_SEQUENCE_LENGTH = 4096
+# Load model
+device_map = calculate_offload_device_map(
+    MODEL_ID, num_gpus=8, reserve_for_hessians=False, torch_dtype="auto", trust_remote_code=True,
+)
+print(device_map)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_ID, device_map=device_map, torch_dtype="auto", trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=1337).select(range(NUM_CALIBRATION_SAMPLES))
+def add_system_prompt(messages):
+    options = ["on", "off"]
+    thinking = random.choice(options)
+    return [{"content": f"detailed thinking {thinking}", "role": "system"}] + messages
+def preprocess(example):
+    return {"text": tokenizer.apply_chat_template(add_system_prompt(example["messages"]), tokenize=False)}
+ds = ds.map(preprocess)
+def tokenize(sample):
+    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+# Configure the quantization algorithms
+recipe = [
+    SmoothQuantModifier(smoothing_strength=0.8),
+    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*125.*", "re:.*134.*", "re:.*143.*", "re:.*149.*"], dampening_frac=0.01, offload_hessians=False),
+]
+# Apply quantization
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+    trust_remote_code_model=True
+)
+# Save the compressed model
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+**Note** that Layers 125, 134, 143 and 149 had to be **excluded** from GPTQ quantization, because their extreme size would lead to allocations of 600+GB Heassian matrices for GPTQ (which couldn't be offloaded for some reason).
+Furthermore, the GPU memory allocation code in calculate_offload_device_map() was adjusted.
+## Evaluation
+### GSM8K (3 Runs)
+#### Original
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9469|±  |0.0062|
+|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9424|±  |0.0064|
+|     |       |strict-match    |     5|exact_match|↑  |0.9401|±  |0.0065|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9454|±  |0.0063|
+|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|Avg: |      3|flexible-extract|     5|exact_match|↑  |0.9449|±  |0.0036|
+|     |       |strict-match    |     5|exact_match|↑  |0.9439|±  |0.0037|
+#### Quantized
+|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9431|±  |0.0064|
+|     |       |strict-match    |     5|exact_match|↑  |0.9393|±  |0.0066|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
+|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9477|±  |0.0061|
+|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|
+|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
+|Avg. |      3|flexible-extract|     5|exact_match|↑  |0.9482|±  |0.0035|
+|     |       |strict-match    |     5|exact_match|↑  |0.9452|±  |0.0036|
+### simple-evals (10x50 Samples each)
+Using custom fork of OpenAI's simple-evals benchmark suite: https://github.com/Ithanil/simple-evals/tree/custom
+These were run using the chat template as well as Nvidias suggested settings:
+- Reasoning Off: Greedy (`temperature=0`), system prompt: `detailed thinking off`
+- Reasoning On: `temperature=0.6`, `top_p=0.95`, system prompt: `detailed thinking on`
+#### Original (Reasoning Off)
+| Benchmark   |   Average Score |   Standard Error |
+|-------------|-----------------|------------------|
+| DROP (F1)   |         92.6556 |         0.711437 |
+| GPQA        |         43.2    |         2.04831  |
+| HumanEval   |         85.6    |         0.37238  |
+| MGSM        |         90.9091 |         1.40836  |
+| MMLU        |         84.6    |         0.6      |
+#### Quantized (Reasoning Off)
+| Benchmark   |   Average Score |   Standard Error |
+|-------------|-----------------|------------------|
+| DROP (F1)   |         91.2381 |         0.843284 |
+| GPQA        |         43.2    |         0.997775 |
+| HumanEval   |         85.08   |         0.430194 |
+| MGSM        |         92.9091 |         0.994013 |
+| MMLU        |         82.8    |         1.04137  |
+i.e. all quantized evals are within statistical error of original model's evals.
+#### Quantized (Reasoning On)
+For completeness, here also results for **Reasoning ON**:
+| Benchmark   |   Average Score |   Standard Error |
+|-------------|-----------------|------------------|
+| DROP (F1)   |         89.8326 |         1.14615  |
+| GPQA        |         61.2    |         1.81842  |
+| HumanEval   |         93      |         0.181353 |
+| MGSM        |         94.9091 |         0.931048 |
+| MMLU        |         85.2    |         0.8      |