Update README.md
Browse files
README.md
CHANGED
@@ -18,12 +18,7 @@ The hybrid quant employs different quantization levels on a per layer basis to i
|
|
18 |
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
|
19 |
and more bits at cortex layers to simulultaneously optimize quantized size and model performance.
|
20 |
This quant was designed to match IQ4_XS size and perform better than IQ4_XS while using all K-quants for faster CPU
|
21 |
-
processing.
|
22 |
-
This moe model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU
|
23 |
-
to open up very large context space. The smaller size of the optimally quantized parameters will give
|
24 |
-
an effective boost in CPU processing speed due to reducing the memory BW needed to repeatedly copy them
|
25 |
-
from main memory to SIMD regs. It can also run fully offloaded on GPU via RPC or high VRAM GPU. For
|
26 |
-
this file the layer quants are as follows:
|
27 |
```
|
28 |
LAYER_TYPES='[
|
29 |
[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
|
@@ -44,6 +39,17 @@ Quant | size | PPL | Comment
|
|
44 |
IQ4_XS | 16.6e9 | 9.15 | default embed and output
|
45 |
Q4_K_H | 16.6e9 | 9.10 | Q4_K embed Q6_K output
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
## Download the file from below:
|
48 |
| Link | Type | Size/e9 B | Notes |
|
49 |
|------|------|-----------|-------|
|
|
|
18 |
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
|
19 |
and more bits at cortex layers to simulultaneously optimize quantized size and model performance.
|
20 |
This quant was designed to match IQ4_XS size and perform better than IQ4_XS while using all K-quants for faster CPU
|
21 |
+
processing. For this file the layer quants are as follows:
|
|
|
|
|
|
|
|
|
|
|
22 |
```
|
23 |
LAYER_TYPES='[
|
24 |
[0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
|
|
|
39 |
IQ4_XS | 16.6e9 | 9.15 | default embed and output
|
40 |
Q4_K_H | 16.6e9 | 9.10 | Q4_K embed Q6_K output
|
41 |
|
42 |
+
Usage:
|
43 |
+
|
44 |
+
This moe model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU
|
45 |
+
to open up very large context space. The smaller size of the optimally quantized parameters will give
|
46 |
+
an effective boost in CPU processing speed due to reducing the memory BW needed to repeatedly copy them
|
47 |
+
from main memory to SIMD regs. It can also run fully offloaded on GPU via RPC or high VRAM GPU.
|
48 |
+
|
49 |
+
Benchmarks:
|
50 |
+
|
51 |
+
Partial evals for the model are given here: https://huggingface.co/spaces/steampunque/benchlm.
|
52 |
+
|
53 |
## Download the file from below:
|
54 |
| Link | Type | Size/e9 B | Notes |
|
55 |
|------|------|-----------|-------|
|