steampunque commited on
Commit
d7a0f37
·
verified ·
1 Parent(s): 4be0dfb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -18,12 +18,7 @@ The hybrid quant employs different quantization levels on a per layer basis to i
18
  flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
19
  and more bits at cortex layers to simulultaneously optimize quantized size and model performance.
20
  This quant was designed to match IQ4_XS size and perform better than IQ4_XS while using all K-quants for faster CPU
21
- processing. Partial evals for the model are given here: https://huggingface.co/spaces/steampunque/benchlm.
22
- This moe model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU
23
- to open up very large context space. The smaller size of the optimally quantized parameters will give
24
- an effective boost in CPU processing speed due to reducing the memory BW needed to repeatedly copy them
25
- from main memory to SIMD regs. It can also run fully offloaded on GPU via RPC or high VRAM GPU. For
26
- this file the layer quants are as follows:
27
  ```
28
  LAYER_TYPES='[
29
  [0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
@@ -44,6 +39,17 @@ Quant | size | PPL | Comment
44
  IQ4_XS | 16.6e9 | 9.15 | default embed and output
45
  Q4_K_H | 16.6e9 | 9.10 | Q4_K embed Q6_K output
46
 
 
 
 
 
 
 
 
 
 
 
 
47
  ## Download the file from below:
48
  | Link | Type | Size/e9 B | Notes |
49
  |------|------|-----------|-------|
 
18
  flexibility of trading off performance vs file size. Less parameter bits are used at deep layers
19
  and more bits at cortex layers to simulultaneously optimize quantized size and model performance.
20
  This quant was designed to match IQ4_XS size and perform better than IQ4_XS while using all K-quants for faster CPU
21
+ processing. For this file the layer quants are as follows:
 
 
 
 
 
22
  ```
23
  LAYER_TYPES='[
24
  [0 ,"Q3_K_M"],[1 ,"Q3_K_M"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"],
 
39
  IQ4_XS | 16.6e9 | 9.15 | default embed and output
40
  Q4_K_H | 16.6e9 | 9.10 | Q4_K embed Q6_K output
41
 
42
+ Usage:
43
+
44
+ This moe model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU
45
+ to open up very large context space. The smaller size of the optimally quantized parameters will give
46
+ an effective boost in CPU processing speed due to reducing the memory BW needed to repeatedly copy them
47
+ from main memory to SIMD regs. It can also run fully offloaded on GPU via RPC or high VRAM GPU.
48
+
49
+ Benchmarks:
50
+
51
+ Partial evals for the model are given here: https://huggingface.co/spaces/steampunque/benchlm.
52
+
53
  ## Download the file from below:
54
  | Link | Type | Size/e9 B | Notes |
55
  |------|------|-----------|-------|