|
--- |
|
library_name: gguf |
|
tags: |
|
- llama |
|
- quantized |
|
- gptq |
|
- evopress |
|
model_type: llama |
|
base_model: meta-llama/Llama-3.1-8B-Instruct |
|
--- |
|
|
|
# Llama-3.1-8B-Instruct GGUF DASLab Quantization |
|
|
|
This repository contains advanced quantized versions of Llama 3.1 8B Instruct using **GPTQ quantization** and **GPTQ+EvoPress optimization** from the [DASLab GGUF Toolkit](https://github.com/IST-DASLab/gguf-toolkit). |
|
|
|
## Models |
|
|
|
- **GPTQ Uniform**: High-quality GPTQ quantization at 2-6 bit precision |
|
- **GPTQ+EvoPress**: Non-uniform per-layer quantization discovered via evolutionary search |
|
|
|
## Performance |
|
|
|
Our GPTQ-based quantization methods achieve **superior quality-compression tradeoffs** compared to standard quantization: |
|
|
|
- **Better perplexity** at equivalent bitwidths vs. naive quantization approaches |
|
- **Error-correcting updates** during calibration for improved accuracy |
|
- **Optimized configurations** that allocate bits based on layer sensitivity (EvoPress) |
|
|
|
## Usage |
|
|
|
Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required. |
|
|
|
**Full documentation, evaluation results, and toolkit source**: https://github.com/IST-DASLab/gguf-toolkit |
|
|
|
--- |