San Diego State University - James Sillberad Brown Center for Center for Artificial Intelligence

GPTQ 4bit Quantized Version of the R1 1776 Distilliation to Llama 3.3 70B by Perplexity AI

This is optimized for VLLM. Some benchmarks are forthcoming comparing the quantized version to the BF16 one

Model Overview

This repository hosts a 4-bit GPTQ-quantized version of perplexity-ai/r1-1776-distill-llama-70b, a distilled 70B parameter LLaMA-based model. The quantization was performed using LLM Compressor, enabling significant memory and compute efficiency gains with minimal degradation in model performance.

This model is ideal for high-throughput inference scenarios and is fully compatible with vLLM.

Quantization Details

Quantization Method: GPTQ (Groupwise Post-Training Quantization)
Precision: 4-bit weights, 16-bit activations (W4A16)
Group Size: 128
Quantized Modules: All nn.Linear layers (excluding lm_head)
Calibration Samples: 64
Sequence Length Used for Calibration: 1024 tokens
Calibration Data Source: OpenWebText
Quantization Tool: LLM Compressor v0.5+
Hardware Used: 2× A100 80GB GPUs
Memory Footprint (Inference): ~35–40 GB GPU memory

Intended Use

Inference with vLLM
Memory-efficient chatbot/completion serving
Model experimentation at reduced cost

This model is especially suited for developers or researchers looking to deploy LLaMA 70B on a single high-memory GPU (e.g. A100, H100, MI300x) with int4 throughput and high generation quality.

Limitations

Outputs are not filtered or RLHF-tuned; content safety must be handled by downstream applications.
This model was quantized without retraining; very slight accuracy degradation may occur.
Calibration dataset is general-purpose; domain-specific tasks may benefit from re-quantization with task-aligned prompts.

Usage Example (vLLM)

vllm --model your-org/r1-1776-llama70b-gptq-4bit

Or via the Python API:

from vllm import LLM

llm = LLM(model="your-org/r1-1776-llama70b-gptq-4bit")
output = llm.generate("Explain GPTQ quantization in simple terms.")
print(output)

Citation

If you use this model or quantization setup in your work, please consider citing:

jsbaicenter
/

r1-1776-distill-llama-70b-GPTQ-4bit