ubergarm commited on
Commit
e1ce4cd
·
1 Parent(s): d7a44e7

Release DeepSeek-V3-0324-IQ4_K_R4 and benchmarks

Browse files
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb8ef08b44a99223040cb02d2f89764eb03662669a65c690da670a3770521f57
3
+ size 41169676352
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00002-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ddbd081cbdad380bb4548c81a2fc43a7f405d306f29678dfa1283b998c0ff3f
3
+ size 42494252256
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00003-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74789329b86ab85418f361e0e167c627ff94b0c12d27a1acd75823120c6b82e4
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00004-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c2840f878709701a655caca5ee86952293cf00137677065582eed49595491a4
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00005-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a07cb7b0c4d8693fce701d08e9ec4cb2e693273279ba39fd17c3a1755439e81c
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00006-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18483856dcc014e7aa32c55b641695ff05095822b86c05c87d901f9d1b3dfee2
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00007-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c925b58c8394d1e965c930e2f6c415b0ea28cefb4bf6c383575f5e27d60c89a
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00008-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:520fecd53d32111018cd13c235d5731c737865497560726c4d253804476516ae
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00009-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8eba7dada84aad746661978ef4edcd6cf6b12d5a2cb27840d52d49dfeb89d882
3
+ size 42494252288
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00010-of-00010.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1af85ec57870ca34ea61152fd6ee2697bd8a3265006c8965ce80b12904ab1b46
3
+ size 33542014112
README.md CHANGED
@@ -4,15 +4,174 @@ pipeline_tag: text-generation
4
  base_model: deepseek-ai/DeepSeek-V3-0324
5
  license: mit
6
  base_model_relation: quantized
 
 
 
 
 
7
  ---
8
 
9
- ## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324 by deepseek-ai
10
 
11
- This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download it and expect it to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
12
 
13
- These quants provide great perplexity for the size. MLA support allows 32k+ (or even 64k+) context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
14
 
15
- ## imatrix
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  <details>
18
 
@@ -59,33 +218,16 @@ Final estimate: PPL = 3.4755 +/- 0.03305
59
 
60
  </details>
61
 
62
- ## Quant Collection
63
-
64
- #### `IQ2_K_R4`
65
- Hybrid `IQ2_K_R4` non-linear quant for 32k context using `q8_0` MLA in for CPU+GPU offload with 96+GB RAM and 24+GB VRAM with minimal perplexity.
66
-
67
- <details>
68
-
69
- <summary>`IQ2_K_R4` Details Here</summary>
70
-
71
- ```bash
72
- $ git branch
73
- * ik/make_qx_quants
74
-
75
- $ git rev-parse --short HEAD
76
- b9c25fe7
77
- ```
78
-
79
- ---
80
-
81
- ## Quantize Script
82
 
83
  ```bash
84
  #!/usr/bin/env bash
85
 
86
  custom="
87
- # Token embedding and output tensors (GPU)
 
88
  token_embd\.weight=q8_0
 
89
  output\.weight=q8_0
90
  output_norm\.weight=q8_0
91
 
@@ -93,6 +235,7 @@ output_norm\.weight=q8_0
93
  blk\.[0-2]\..*=q8_0
94
 
95
  # All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
 
96
  blk\.[3-9]\.attn_.*=q8_0
97
  blk\.[1-5][0-9]\.attn_.*=q8_0
98
  blk\.60\.attn_.*=q8_0
@@ -114,7 +257,8 @@ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
114
  blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
115
  blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
116
 
117
- # MoE Experts (3-60) (CPU)
 
118
  blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
119
  blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
120
  blk\.60\.ffn_down_exps\.weight=iq3_k_r4
@@ -140,9 +284,7 @@ custom=$(
140
  24
141
  ```
142
 
143
- ---
144
-
145
- ## Perplexity
146
 
147
  ```bash
148
  $ CUDA_VISIBLE_DEVICES="0," \
@@ -559,12 +701,10 @@ llama_print_timings: total time = 2841519.57 ms / 287233 tokens
559
  Final estimate: PPL = 3.5614 +/- 0.02001
560
  ```
561
 
562
- ---
563
-
564
- ## Split
565
 
566
  ```bash
567
- $ ./build/bin/llama-gguf-split
568
  --dry-run \
569
  --split \
570
  --split-max-size 50G \
@@ -574,44 +714,6 @@ $ ./build/bin/llama-gguf-split
574
 
575
  </details>
576
 
577
- #### `TODO`
578
-
579
- - [ ] Upload good CPU *only* optimized inferencing quant
580
-
581
- ## `ik_llama.cpp` API server
582
-
583
- ```bash
584
- # I think temperature "1.0" on the API is 0.3 in llama.cpp ????
585
- # https://api-docs.deepseek.com/quick_start/parameter_settings
586
- # https://www.reddit.com/r/LocalLLaMA/comments/1jk0qjs/178bit_deepseekv30324_230gb_unsloth_dynamic_gguf/
587
-
588
- # Uses just under 24GB VRAM
589
- CUDA_VISIBLE_DEVICES="0," \
590
- ./build/bin/llama-server \
591
- --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
592
- --alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
593
- --ctx-size 32768 \
594
- -ctk q8_0 \
595
- -mla 2 -fa \
596
- -amb 512 \
597
- -fmoe \
598
- --min-p 0.01 \
599
- --temp 0.0 \
600
- --n-gpu-layers 63 \
601
- --override-tensor exps=CPU \
602
- --parallel 1 \
603
- --threads 16 \
604
- --host 127.0.0.1 \
605
- --port 8080
606
- ```
607
-
608
- ## Big Thanks
609
- Big thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for sharing tips and tricks to help each other access all the fun new models!
610
-
611
- Shout out to the **Level1Techs** crew, community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs), and for providing big hardware expertise and access to run these experiments!!!
612
-
613
- Finally, I'm still learning the ropes, so please be patient and we can learn together. Thanks!
614
-
615
  ## References
616
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
617
  * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
 
4
  base_model: deepseek-ai/DeepSeek-V3-0324
5
  license: mit
6
  base_model_relation: quantized
7
+ tags:
8
+ - mla
9
+ - imatrix
10
+ - deepseek_v3
11
+ - conversational
12
  ---
13
 
14
+ ## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324
15
 
16
+ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
17
 
18
+ These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
19
 
20
+ Perfect for CPU+GPU systems with 24GB+ VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput).
21
+
22
+ You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
23
+
24
+ ## Big Thanks
25
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
26
+
27
+ Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
28
+
29
+ Excited to share and learn together. Thanks!
30
+
31
+ ## Quant Collection
32
+ So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
33
+
34
+ #### `IQ4_K_R4` 4.936 BPW
35
+ Special mix `IQ5_K_R4`/`IQ4_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
36
+ Great for big 384+ GB RAM rig with 24GB+ GPU
37
+
38
+ #### `IQ2_K_R4` 2.889 BPW
39
+ Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
40
+ Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
41
+
42
+ #### Custom Mixes
43
+ If you have multiple GPUs and more VRAM, you can make custom quants to optimize size and quants whatever hardware you have. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
44
+
45
+ ## Quick Start
46
+ #### `ik_llama.cpp` API server for GPU+CPU
47
+ ```bash
48
+ # Fits 32k context in under 24GB VRAM
49
+ # Optional `-ser 6,1` improves speed at minimal cost to quality
50
+ CUDA_VISIBLE_DEVICES="0," \
51
+ ./build/bin/llama-server \
52
+ --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
53
+ --alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
54
+ --ctx-size 32768 \
55
+ -ctk q8_0 \
56
+ -mla 2 -fa \
57
+ -amb 512 \
58
+ -fmoe \
59
+ --temp 0.3 \
60
+ --min-p 0.05 \
61
+ --n-gpu-layers 63 \
62
+ --override-tensor exps=CPU \
63
+ --parallel 1 \
64
+ --threads 16 \
65
+ --host 127.0.0.1 \
66
+ --port 8080
67
+ ```
68
+
69
+ #### `ik_llama.cpp` API server for CPU *only*
70
+ ```
71
+ # The goal for now is as much RAM bandwidth in a single NUMA node e.g.
72
+ # Use BIOS `NPS0` on AMD Epyc or single socket of Intel Xeon in BIOS `SNC=Disable`
73
+ # Tune your `--threads` for token generation, and `--threads-batch` for prompt processing (prefill)
74
+ # Note `--run-time-repack` will pre-allocate enough RAM for model weights instead of mmap()'ing off disk
75
+ # Note there are options for both Explicit and Transparent Huge Pages with tuning discussions in [git repo](https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515)
76
+ numactl -N 0 -m 0 \
77
+ ./build/bin/llama-server \
78
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
79
+ --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
80
+ --run-time-repack \
81
+ --ctx-size 65536 \
82
+ -ctk q8_0 \
83
+ -mla 3 -fa \
84
+ -amb 512 \
85
+ -fmoe \
86
+ --temp 0.3 \
87
+ --min-p 0.05 \
88
+ --parallel 1 \
89
+ --threads 88 \
90
+ --threads-batch 128 \
91
+ --numa numactl \
92
+ --host 127.0.0.1 \
93
+ --port 8080
94
+ ```
95
+
96
+ ## Quant Comparisons
97
+
98
+ These are probably the **best quants available in this size class** for `V3-0324`!
99
+
100
+ [!][Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
101
+
102
+ ubergarm made no sacrifices for token embedding, attention, dense
103
+ layers, or shared experts. This is possible because `ik_llama.cpp` MLA
104
+ implementation saves so much GPU VRAM enabling 32k context in under 24GB
105
+ VRAM. Also these quants use a new high quality imatrix including various
106
+ coding samples and multiple written languages. Routed expert layers
107
+ make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
108
+ best perplexity per GiB.
109
+
110
+ bartowski uses full token embedding quality but lower attention, dense
111
+ layers, and shared expert quants. He does use a good quality imatrix with
112
+ perplexity performance within the measurement error relative to this one.
113
+
114
+ unsloth sacrifices token embedding with middle quality attention and
115
+ dense layers, but no importance matrix.
116
+
117
+ mradermacher modelcard side-bar is not showing so didn't yet fully
118
+ compare exact recipe. Working with them to get info on their split GGUFs.
119
+
120
+ #### Comparison Details
121
+
122
+ <details>
123
+
124
+ <summary>Detailed Comparison of ~Q2 Class Quants</summary>
125
+
126
+ | | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
127
+ | --- | --- | --- | --- | --- |
128
+ | **Overview** | | | | |
129
+ | `split.tensors.count` | 1147 | 1025 | 1025 | |
130
+ | `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | |
131
+ | File Size (GiB) | 227 | 228 | 231 | |
132
+ | **Multi-Head Latent Attention** | | | | |
133
+ | `blk.*.attn_kv_b.weight` | `Q8_0` | n/a | n/a | n/a |
134
+ | `blk.*.attn_k_b.weight` | `Q8_0` | n/a | n/a | n/a |
135
+ | `blk.*.attn_v_b.weight` | `Q8_0` | n/a | n/a | n/a |
136
+ | **Dense Layers** | | | | |
137
+ | `blk.[0-2].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
138
+ | `blk.[0-2].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
139
+ | `blk.[0-2].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
140
+ | `blk.[0-2].attn_norm.weight` | `F32` | `F32` | `F32` | |
141
+ | `blk.[0-2].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
142
+ | `blk.[0-2].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
143
+ | `blk.[0-2].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
144
+ | `blk.[0-2].ffn_down.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
145
+ | `blk.[0-2].ffn_gate.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
146
+ | `blk.[0-2].ffn_norm.weight` | `F32` | `F32` | `F32` | |
147
+ | `blk.[0-2].ffn_up.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
148
+ | `blk.[0-2].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
149
+ | **Shared & Routed MoE Layers** | | | | |
150
+ | `blk.[3-60].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
151
+ | `blk.[3-60].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
152
+ | `blk.[3-60].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
153
+ | `blk.[3-60].attn_norm.weight` | `F32` | `F32` | `F32` | |
154
+ | `blk.[3-60].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
155
+ | `blk.[3-60].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
156
+ | `blk.[3-60].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
157
+ | `blk.[3-60].exp_probs_b.bias` | `F32` | `F32` | `F32` | |
158
+ | `blk.[3-60].ffn_down_exps.weight` | `IQ3_K_R4` | `Q3_K` | `Q3_K` | |
159
+ | `blk.[3-60].ffn_down_shexp.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
160
+ | `blk.[3-60].ffn_gate_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
161
+ | `blk.[3-60].ffn_gate_inp.weight` | `F32` | `F32` | `F32` | |
162
+ | `blk.[3-60].ffn_gate_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
163
+ | `blk.[3-60].ffn_norm.weight` | `F32` | `F32` | `F32` | |
164
+ | `blk.[3-60].ffn_up_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
165
+ | `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
166
+ | `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
167
+ | **Important Matrix & Perplexity** | | | | |
168
+ | `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | n/a | ? |
169
+ | Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | ? | ? | ? |
170
+
171
+
172
+ </details>
173
+
174
+ #### imatrix
175
 
176
  <details>
177
 
 
218
 
219
  </details>
220
 
221
+ #### Quant Cookers Secret Recipe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
 
223
  ```bash
224
  #!/usr/bin/env bash
225
 
226
  custom="
227
+ # Token embedding (GPU)
228
+ # NOTE: cannot be a repacked type due to tensor size
229
  token_embd\.weight=q8_0
230
+ # output tensors (GPU)
231
  output\.weight=q8_0
232
  output_norm\.weight=q8_0
233
 
 
235
  blk\.[0-2]\..*=q8_0
236
 
237
  # All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
238
+ # NOTE: attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128
239
  blk\.[3-9]\.attn_.*=q8_0
240
  blk\.[1-5][0-9]\.attn_.*=q8_0
241
  blk\.60\.attn_.*=q8_0
 
257
  blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
258
  blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
259
 
260
+ # Routed Experts (3-60) (CPU)
261
+ # NOTE: Traditional wisdom suggests earlier layers use higher quants
262
  blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
263
  blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
264
  blk\.60\.ffn_down_exps\.weight=iq3_k_r4
 
284
  24
285
  ```
286
 
287
+ #### Perplexity
 
 
288
 
289
  ```bash
290
  $ CUDA_VISIBLE_DEVICES="0," \
 
701
  Final estimate: PPL = 3.5614 +/- 0.02001
702
  ```
703
 
704
+ #### Split
 
 
705
 
706
  ```bash
707
+ $ ./build/bin/llama-gguf-split \
708
  --dry-run \
709
  --split \
710
  --split-max-size 50G \
 
714
 
715
  </details>
716
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
717
  ## References
718
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
719
  * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
benchmarks-01.png ADDED

Git LFS Details

  • SHA256: 4e2a26bfcf183ad354822b6a82beba4bd67d6f991c1aa3e2775e15552f2fea57
  • Pointer size: 131 Bytes
  • Size of remote file: 239 kB