---
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
base_model:
- nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
---

# Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic

SmoothQuant/GPTQ W8A8 quantization of https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1

## Creation

Created with llmcompressor using the following code:

```
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
import random

# Config
MODEL_ID = "/models/Llama-3_1-Nemotron-Ultra-253B-v1"
SAVE_DIR = "/models/Llama-3_1-Nemotron-Ultra-253B-v1-W8A8-Dynamic"
NUM_CALIBRATION_SAMPLES = 1024
MAX_SEQUENCE_LENGTH = 4096

# Load model
device_map = calculate_offload_device_map(
    MODEL_ID, num_gpus=8, reserve_for_hessians=False, torch_dtype="auto", trust_remote_code=True,
)
print(device_map)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype="auto", trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load and preprocess the dataset
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=1337).select(range(NUM_CALIBRATION_SAMPLES))

def add_system_prompt(messages):
    options = ["on", "off"]
    thinking = random.choice(options)
    return [{"content": f"detailed thinking {thinking}", "role": "system"}] + messages

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(add_system_prompt(example["messages"]), tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head", "re:.*125.*", "re:.*134.*", "re:.*143.*", "re:.*149.*"], dampening_frac=0.01, offload_hessians=False),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True
)

# Save the compressed model
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```

**Note** that Layers 125, 134, 143 and 149 had to be **excluded** from GPTQ quantization, because their extreme size would lead to allocations of 600+GB Heassian matrices for GPTQ (which couldn't be offloaded for some reason).
Furthermore, the GPU memory allocation code in calculate_offload_device_map() was adjusted.

## Evaluation

### GSM8K (3 Runs)

#### Original
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9469|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9424|±  |0.0064|
|     |       |strict-match    |     5|exact_match|↑  |0.9401|±  |0.0065|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9454|±  |0.0063|
|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|Avg: |      3|flexible-extract|     5|exact_match|↑  |0.9449|±  |0.0036|
|     |       |strict-match    |     5|exact_match|↑  |0.9439|±  |0.0037|

#### Quantized
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9431|±  |0.0064|
|     |       |strict-match    |     5|exact_match|↑  |0.9393|±  |0.0066|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9477|±  |0.0061|
|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|Avg. |      3|flexible-extract|     5|exact_match|↑  |0.9482|±  |0.0035|
|     |       |strict-match    |     5|exact_match|↑  |0.9452|±  |0.0036|

### simple-evals (10x50 Samples each)

Using custom fork of OpenAI's simple-evals benchmark suite: https://github.com/Ithanil/simple-evals/tree/custom

These were run using the chat template as well as Nvidias suggested settings:
- Reasoning Off: Greedy (`temperature=0`), system prompt: `detailed thinking off`
- Reasoning On: `temperature=0.6`, `top_p=0.95`, system prompt: `detailed thinking on`

#### Original (Reasoning Off)

| Benchmark   |   Average Score |   Standard Error |
|-------------|-----------------|------------------|
| DROP (F1)   |         92.6556 |         0.711437 |
| GPQA        |         43.2    |         2.04831  |
| HumanEval   |         85.6    |         0.37238  |
| MGSM        |         90.9091 |         1.40836  |
| MMLU        |         84.6    |         0.6      |

#### Quantized (Reasoning Off)

| Benchmark   |   Average Score |   Standard Error |
|-------------|-----------------|------------------|
| DROP (F1)   |         91.2381 |         0.843284 |
| GPQA        |         43.2    |         0.997775 |
| HumanEval   |         85.08   |         0.430194 |
| MGSM        |         92.9091 |         0.994013 |
| MMLU        |         82.8    |         1.04137  |

i.e. all quantized evals are within statistical error of original model's evals.

#### Quantized (Reasoning On)
For completeness, here also results for **Reasoning ON**:

| Benchmark   |   Average Score |   Standard Error |
|-------------|-----------------|------------------|
| DROP (F1)   |         89.8326 |         1.14615  |
| GPQA        |         61.2    |         1.81842  |
| HumanEval   |         93      |         0.181353 |
| MGSM        |         94.9091 |         0.931048 |
| MMLU        |         85.2    |         0.8      |