San Diego State University - James Sillberad Brown Center for Center for Artificial Intelligence

GPTQ 4bit Quantized Version of the R1 1776 Distilliation to Llama 3.3 70B by Perplexity AI

This is optimized for VLLM. Some benchmarks are forthcoming comparing the quantized version to the BF16 one

Model Overview

This repository hosts a 4-bit GPTQ-quantized version of perplexity-ai/r1-1776-distill-llama-70b, a distilled 70B parameter LLaMA-based model. The quantization was performed using LLM Compressor, enabling significant memory and compute efficiency gains with minimal degradation in model performance.

This model is ideal for high-throughput inference scenarios and is fully compatible with vLLM.


Quantization Details

  • Quantization Method: GPTQ (Groupwise Post-Training Quantization)
  • Precision: 4-bit weights, 16-bit activations (W4A16)
  • Group Size: 128
  • Quantized Modules: All nn.Linear layers (excluding lm_head)
  • Calibration Samples: 64
  • Sequence Length Used for Calibration: 1024 tokens
  • Calibration Data Source: OpenWebText
  • Quantization Tool: LLM Compressor v0.5+
  • Hardware Used: 2× A100 80GB GPUs
  • Memory Footprint (Inference): ~35–40 GB GPU memory

Intended Use

  • Inference with vLLM
  • Memory-efficient chatbot/completion serving
  • Model experimentation at reduced cost

This model is especially suited for developers or researchers looking to deploy LLaMA 70B on a single high-memory GPU (e.g. A100, H100, MI300x) with int4 throughput and high generation quality.


Limitations

  • Outputs are not filtered or RLHF-tuned; content safety must be handled by downstream applications.
  • This model was quantized without retraining; very slight accuracy degradation may occur.
  • Calibration dataset is general-purpose; domain-specific tasks may benefit from re-quantization with task-aligned prompts.

Usage Example (vLLM)

vllm --model your-org/r1-1776-llama70b-gptq-4bit

Or via the Python API:

from vllm import LLM

llm = LLM(model="your-org/r1-1776-llama70b-gptq-4bit")
output = llm.generate("Explain GPTQ quantization in simple terms.")
print(output)

Citation

If you use this model or quantization setup in your work, please consider citing:

Downloads last month
13
Safetensors
Model size
11.2B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsbaicenter/r1-1776-distill-llama-70b-GPTQ-4bit