metadata
library_name: gguf
tags:
- llama
- quantized
- gptq
- evopress
model_type: llama
base_model: meta-llama/Llama-3.1-8B-Instruct
Llama-3.1-8B-Instruct GGUF DASLab Quantization
This repository contains advanced quantized versions of Llama 3.1 8B Instruct using GPTQ quantization and GPTQ+EvoPress optimization from the DASLab GGUF Toolkit.
Models
- GPTQ Uniform: High-quality GPTQ quantization at 2-6 bit precision
- GPTQ+EvoPress: Non-uniform per-layer quantization discovered via evolutionary search
Performance
Our GPTQ-based quantization methods achieve superior quality-compression tradeoffs compared to standard quantization:
- Better perplexity at equivalent bitwidths vs. naive quantization approaches
- Error-correcting updates during calibration for improved accuracy
- Optimized configurations that allocate bits based on layer sensitivity (EvoPress)
Usage
Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required.
Full documentation, evaluation results, and toolkit source: https://github.com/IST-DASLab/gguf-toolkit