Update README.md
Browse files
README.md
CHANGED
@@ -3,4 +3,33 @@ base_model:
|
|
3 |
- meta-llama/Llama-3.1-8B-Instruct
|
4 |
---
|
5 |
|
6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
- meta-llama/Llama-3.1-8B-Instruct
|
4 |
---
|
5 |
|
6 |
+
# Llama-3.1-8B-Instruct GGUF DASLab Quantization
|
7 |
+
|
8 |
+
This repository contains advanced quantized versions of Llama 3.1 8B Instruct using **GPTQ quantization** and **GPTQ+EvoPress optimization** from the [DASLab GGUF Toolkit](https://github.com/IST-DASLab/gguf-toolkit).
|
9 |
+
|
10 |
+
## Models
|
11 |
+
|
12 |
+
- **GPTQ Uniform**: High-quality GPTQ quantization at 2-6 bit precision
|
13 |
+
- **GPTQ+EvoPress**: Non-uniform per-layer quantization discovered via evolutionary search
|
14 |
+
|
15 |
+
## Performance
|
16 |
+
|
17 |
+
Our GPTQ-based quantization methods achieve **superior quality-compression tradeoffs** compared to standard quantization:
|
18 |
+
|
19 |
+
- **Better perplexity** at equivalent bitwidths vs. naive quantization approaches
|
20 |
+
- **Error-correcting updates** during calibration for improved accuracy
|
21 |
+
- **Optimized configurations** that allocate bits based on layer sensitivity (EvoPress)
|
22 |
+
|
23 |
+
| Method | Avg Bits | C4 PPL | WikiText2 PPL |
|
24 |
+
|--------|----------|--------|---------------|
|
25 |
+
| GPTQ-4 | 4.50 | 11.35 | 6.89 |
|
26 |
+
| EvoPress-GPTQ-4 | 4.50 | 11.35 | 6.89 |
|
27 |
+
| EvoPress-GPTQ-5 | 5.51 | 11.13 | 6.79 |
|
28 |
+
|
29 |
+
## Usage
|
30 |
+
|
31 |
+
Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required.
|
32 |
+
|
33 |
+
**Full documentation, evaluation results, and toolkit source**: https://github.com/IST-DASLab/gguf-toolkit
|
34 |
+
|
35 |
+
---
|