Release DeepSeek-V3-0324-IQ4_K_R4 and benchmarks
Browse files- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00002-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00003-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00004-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00005-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00006-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00007-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00008-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00009-of-00010.gguf +3 -0
- DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00010-of-00010.gguf +3 -0
- README.md +173 -71
- benchmarks-01.png +3 -0
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:eb8ef08b44a99223040cb02d2f89764eb03662669a65c690da670a3770521f57
|
3 |
+
size 41169676352
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00002-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7ddbd081cbdad380bb4548c81a2fc43a7f405d306f29678dfa1283b998c0ff3f
|
3 |
+
size 42494252256
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00003-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:74789329b86ab85418f361e0e167c627ff94b0c12d27a1acd75823120c6b82e4
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00004-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7c2840f878709701a655caca5ee86952293cf00137677065582eed49595491a4
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00005-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a07cb7b0c4d8693fce701d08e9ec4cb2e693273279ba39fd17c3a1755439e81c
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00006-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:18483856dcc014e7aa32c55b641695ff05095822b86c05c87d901f9d1b3dfee2
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00007-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7c925b58c8394d1e965c930e2f6c415b0ea28cefb4bf6c383575f5e27d60c89a
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00008-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:520fecd53d32111018cd13c235d5731c737865497560726c4d253804476516ae
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00009-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:8eba7dada84aad746661978ef4edcd6cf6b12d5a2cb27840d52d49dfeb89d882
|
3 |
+
size 42494252288
|
DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00010-of-00010.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1af85ec57870ca34ea61152fd6ee2697bd8a3265006c8965ce80b12904ab1b46
|
3 |
+
size 33542014112
|
README.md
CHANGED
@@ -4,15 +4,174 @@ pipeline_tag: text-generation
|
|
4 |
base_model: deepseek-ai/DeepSeek-V3-0324
|
5 |
license: mit
|
6 |
base_model_relation: quantized
|
|
|
|
|
|
|
|
|
|
|
7 |
---
|
8 |
|
9 |
-
## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324
|
10 |
|
11 |
-
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download
|
12 |
|
13 |
-
These quants provide
|
14 |
|
15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
<details>
|
18 |
|
@@ -59,33 +218,16 @@ Final estimate: PPL = 3.4755 +/- 0.03305
|
|
59 |
|
60 |
</details>
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
#### `IQ2_K_R4`
|
65 |
-
Hybrid `IQ2_K_R4` non-linear quant for 32k context using `q8_0` MLA in for CPU+GPU offload with 96+GB RAM and 24+GB VRAM with minimal perplexity.
|
66 |
-
|
67 |
-
<details>
|
68 |
-
|
69 |
-
<summary>`IQ2_K_R4` Details Here</summary>
|
70 |
-
|
71 |
-
```bash
|
72 |
-
$ git branch
|
73 |
-
* ik/make_qx_quants
|
74 |
-
|
75 |
-
$ git rev-parse --short HEAD
|
76 |
-
b9c25fe7
|
77 |
-
```
|
78 |
-
|
79 |
-
---
|
80 |
-
|
81 |
-
## Quantize Script
|
82 |
|
83 |
```bash
|
84 |
#!/usr/bin/env bash
|
85 |
|
86 |
custom="
|
87 |
-
# Token embedding
|
|
|
88 |
token_embd\.weight=q8_0
|
|
|
89 |
output\.weight=q8_0
|
90 |
output_norm\.weight=q8_0
|
91 |
|
@@ -93,6 +235,7 @@ output_norm\.weight=q8_0
|
|
93 |
blk\.[0-2]\..*=q8_0
|
94 |
|
95 |
# All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
|
|
|
96 |
blk\.[3-9]\.attn_.*=q8_0
|
97 |
blk\.[1-5][0-9]\.attn_.*=q8_0
|
98 |
blk\.60\.attn_.*=q8_0
|
@@ -114,7 +257,8 @@ blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
|
114 |
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
115 |
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
|
116 |
|
117 |
-
#
|
|
|
118 |
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
|
119 |
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
|
120 |
blk\.60\.ffn_down_exps\.weight=iq3_k_r4
|
@@ -140,9 +284,7 @@ custom=$(
|
|
140 |
24
|
141 |
```
|
142 |
|
143 |
-
|
144 |
-
|
145 |
-
## Perplexity
|
146 |
|
147 |
```bash
|
148 |
$ CUDA_VISIBLE_DEVICES="0," \
|
@@ -559,12 +701,10 @@ llama_print_timings: total time = 2841519.57 ms / 287233 tokens
|
|
559 |
Final estimate: PPL = 3.5614 +/- 0.02001
|
560 |
```
|
561 |
|
562 |
-
|
563 |
-
|
564 |
-
## Split
|
565 |
|
566 |
```bash
|
567 |
-
$ ./build/bin/llama-gguf-split
|
568 |
--dry-run \
|
569 |
--split \
|
570 |
--split-max-size 50G \
|
@@ -574,44 +714,6 @@ $ ./build/bin/llama-gguf-split
|
|
574 |
|
575 |
</details>
|
576 |
|
577 |
-
#### `TODO`
|
578 |
-
|
579 |
-
- [ ] Upload good CPU *only* optimized inferencing quant
|
580 |
-
|
581 |
-
## `ik_llama.cpp` API server
|
582 |
-
|
583 |
-
```bash
|
584 |
-
# I think temperature "1.0" on the API is 0.3 in llama.cpp ????
|
585 |
-
# https://api-docs.deepseek.com/quick_start/parameter_settings
|
586 |
-
# https://www.reddit.com/r/LocalLLaMA/comments/1jk0qjs/178bit_deepseekv30324_230gb_unsloth_dynamic_gguf/
|
587 |
-
|
588 |
-
# Uses just under 24GB VRAM
|
589 |
-
CUDA_VISIBLE_DEVICES="0," \
|
590 |
-
./build/bin/llama-server \
|
591 |
-
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
|
592 |
-
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
|
593 |
-
--ctx-size 32768 \
|
594 |
-
-ctk q8_0 \
|
595 |
-
-mla 2 -fa \
|
596 |
-
-amb 512 \
|
597 |
-
-fmoe \
|
598 |
-
--min-p 0.01 \
|
599 |
-
--temp 0.0 \
|
600 |
-
--n-gpu-layers 63 \
|
601 |
-
--override-tensor exps=CPU \
|
602 |
-
--parallel 1 \
|
603 |
-
--threads 16 \
|
604 |
-
--host 127.0.0.1 \
|
605 |
-
--port 8080
|
606 |
-
```
|
607 |
-
|
608 |
-
## Big Thanks
|
609 |
-
Big thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for sharing tips and tricks to help each other access all the fun new models!
|
610 |
-
|
611 |
-
Shout out to the **Level1Techs** crew, community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs), and for providing big hardware expertise and access to run these experiments!!!
|
612 |
-
|
613 |
-
Finally, I'm still learning the ropes, so please be patient and we can learn together. Thanks!
|
614 |
-
|
615 |
## References
|
616 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|
617 |
* [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|
|
|
4 |
base_model: deepseek-ai/DeepSeek-V3-0324
|
5 |
license: mit
|
6 |
base_model_relation: quantized
|
7 |
+
tags:
|
8 |
+
- mla
|
9 |
+
- imatrix
|
10 |
+
- deepseek_v3
|
11 |
+
- conversational
|
12 |
---
|
13 |
|
14 |
+
## `ik_llma.cpp` imatrix MLA Quantizations of DeepSeek-V3-0324
|
15 |
|
16 |
+
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants and Multi-Head Latent Attention (MLA). Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
17 |
|
18 |
+
These quants provide best in class perplexity for the given memory footprint. MLA support allows 32k+ context length in under 24GB GPU VRAM for `R1` and `V3` while offloading MoE layers to RAM.
|
19 |
|
20 |
+
Perfect for CPU+GPU systems with 24GB+ VRAM, and also CPU *only* rigs using dynamic quant repacking (for maximum memory throughput).
|
21 |
+
|
22 |
+
You could try `ik_llama.cpp` quickly with your *existing* quants, as it computes MLA tensors and repacks quants on the fly at startup (if you have enough RAM+VRAM to fit entire model). Then come check out these fat quants here once you see the difference.
|
23 |
+
|
24 |
+
## Big Thanks
|
25 |
+
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
|
26 |
+
|
27 |
+
Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
|
28 |
+
|
29 |
+
Excited to share and learn together. Thanks!
|
30 |
+
|
31 |
+
## Quant Collection
|
32 |
+
So far these are my best recipes offering the lowest perplexity per GiB models suitable for a wide variety of CPU+GPU or CPU *only* rigs.
|
33 |
+
|
34 |
+
#### `IQ4_K_R4` 4.936 BPW
|
35 |
+
Special mix `IQ5_K_R4`/`IQ4_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
|
36 |
+
Great for big 384+ GB RAM rig with 24GB+ GPU
|
37 |
+
|
38 |
+
#### `IQ2_K_R4` 2.889 BPW
|
39 |
+
Special mix `IQ3_K_R4`/`IQ2_K_R4` routed experts with all other layers full `q8_0` for CPU+GPU offload or `--run-time-repack` for max speed CPU *only* rigs.
|
40 |
+
Great for CPU+GPU "troll rig" high end gamer systems e.g. 9950X 96 GB RAM + 3090TI 24 GB VRAM + Gen 5 NVMe SSD.
|
41 |
+
|
42 |
+
#### Custom Mixes
|
43 |
+
If you have multiple GPUs and more VRAM, you can make custom quants to optimize size and quants whatever hardware you have. If you have less VRAM, you could make a custom quant leaner in the non routed expert layers or get 64k+ context in 24GB VRAM. Also you can use the offline repack tool if you want to do CPU only with `mmap()` still enabled.
|
44 |
+
|
45 |
+
## Quick Start
|
46 |
+
#### `ik_llama.cpp` API server for GPU+CPU
|
47 |
+
```bash
|
48 |
+
# Fits 32k context in under 24GB VRAM
|
49 |
+
# Optional `-ser 6,1` improves speed at minimal cost to quality
|
50 |
+
CUDA_VISIBLE_DEVICES="0," \
|
51 |
+
./build/bin/llama-server \
|
52 |
+
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
|
53 |
+
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
|
54 |
+
--ctx-size 32768 \
|
55 |
+
-ctk q8_0 \
|
56 |
+
-mla 2 -fa \
|
57 |
+
-amb 512 \
|
58 |
+
-fmoe \
|
59 |
+
--temp 0.3 \
|
60 |
+
--min-p 0.05 \
|
61 |
+
--n-gpu-layers 63 \
|
62 |
+
--override-tensor exps=CPU \
|
63 |
+
--parallel 1 \
|
64 |
+
--threads 16 \
|
65 |
+
--host 127.0.0.1 \
|
66 |
+
--port 8080
|
67 |
+
```
|
68 |
+
|
69 |
+
#### `ik_llama.cpp` API server for CPU *only*
|
70 |
+
```
|
71 |
+
# The goal for now is as much RAM bandwidth in a single NUMA node e.g.
|
72 |
+
# Use BIOS `NPS0` on AMD Epyc or single socket of Intel Xeon in BIOS `SNC=Disable`
|
73 |
+
# Tune your `--threads` for token generation, and `--threads-batch` for prompt processing (prefill)
|
74 |
+
# Note `--run-time-repack` will pre-allocate enough RAM for model weights instead of mmap()'ing off disk
|
75 |
+
# Note there are options for both Explicit and Transparent Huge Pages with tuning discussions in [git repo](https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515)
|
76 |
+
numactl -N 0 -m 0 \
|
77 |
+
./build/bin/llama-server \
|
78 |
+
--model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
|
79 |
+
--alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
|
80 |
+
--run-time-repack \
|
81 |
+
--ctx-size 65536 \
|
82 |
+
-ctk q8_0 \
|
83 |
+
-mla 3 -fa \
|
84 |
+
-amb 512 \
|
85 |
+
-fmoe \
|
86 |
+
--temp 0.3 \
|
87 |
+
--min-p 0.05 \
|
88 |
+
--parallel 1 \
|
89 |
+
--threads 88 \
|
90 |
+
--threads-batch 128 \
|
91 |
+
--numa numactl \
|
92 |
+
--host 127.0.0.1 \
|
93 |
+
--port 8080
|
94 |
+
```
|
95 |
+
|
96 |
+
## Quant Comparisons
|
97 |
+
|
98 |
+
These are probably the **best quants available in this size class** for `V3-0324`!
|
99 |
+
|
100 |
+
[!][Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`](benchmarks-01.png "Benchmarks showing these quants are smaller in size yet similar in performance to the `Q8_0`")
|
101 |
+
|
102 |
+
ubergarm made no sacrifices for token embedding, attention, dense
|
103 |
+
layers, or shared experts. This is possible because `ik_llama.cpp` MLA
|
104 |
+
implementation saves so much GPU VRAM enabling 32k context in under 24GB
|
105 |
+
VRAM. Also these quants use a new high quality imatrix including various
|
106 |
+
coding samples and multiple written languages. Routed expert layers
|
107 |
+
make use of SotA CPU `IQx_K_R4` non-linear quants as well for likely
|
108 |
+
best perplexity per GiB.
|
109 |
+
|
110 |
+
bartowski uses full token embedding quality but lower attention, dense
|
111 |
+
layers, and shared expert quants. He does use a good quality imatrix with
|
112 |
+
perplexity performance within the measurement error relative to this one.
|
113 |
+
|
114 |
+
unsloth sacrifices token embedding with middle quality attention and
|
115 |
+
dense layers, but no importance matrix.
|
116 |
+
|
117 |
+
mradermacher modelcard side-bar is not showing so didn't yet fully
|
118 |
+
compare exact recipe. Working with them to get info on their split GGUFs.
|
119 |
+
|
120 |
+
#### Comparison Details
|
121 |
+
|
122 |
+
<details>
|
123 |
+
|
124 |
+
<summary>Detailed Comparison of ~Q2 Class Quants</summary>
|
125 |
+
|
126 |
+
| | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
|
127 |
+
| --- | --- | --- | --- | --- |
|
128 |
+
| **Overview** | | | | |
|
129 |
+
| `split.tensors.count` | 1147 | 1025 | 1025 | |
|
130 |
+
| `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | |
|
131 |
+
| File Size (GiB) | 227 | 228 | 231 | |
|
132 |
+
| **Multi-Head Latent Attention** | | | | |
|
133 |
+
| `blk.*.attn_kv_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
134 |
+
| `blk.*.attn_k_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
135 |
+
| `blk.*.attn_v_b.weight` | `Q8_0` | n/a | n/a | n/a |
|
136 |
+
| **Dense Layers** | | | | |
|
137 |
+
| `blk.[0-2].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
138 |
+
| `blk.[0-2].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
|
139 |
+
| `blk.[0-2].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
140 |
+
| `blk.[0-2].attn_norm.weight` | `F32` | `F32` | `F32` | |
|
141 |
+
| `blk.[0-2].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
142 |
+
| `blk.[0-2].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
|
143 |
+
| `blk.[0-2].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
144 |
+
| `blk.[0-2].ffn_down.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
|
145 |
+
| `blk.[0-2].ffn_gate.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
146 |
+
| `blk.[0-2].ffn_norm.weight` | `F32` | `F32` | `F32` | |
|
147 |
+
| `blk.[0-2].ffn_up.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
148 |
+
| `blk.[0-2].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
|
149 |
+
| **Shared & Routed MoE Layers** | | | | |
|
150 |
+
| `blk.[3-60].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
151 |
+
| `blk.[3-60].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
|
152 |
+
| `blk.[3-60].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
|
153 |
+
| `blk.[3-60].attn_norm.weight` | `F32` | `F32` | `F32` | |
|
154 |
+
| `blk.[3-60].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
155 |
+
| `blk.[3-60].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
|
156 |
+
| `blk.[3-60].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
157 |
+
| `blk.[3-60].exp_probs_b.bias` | `F32` | `F32` | `F32` | |
|
158 |
+
| `blk.[3-60].ffn_down_exps.weight` | `IQ3_K_R4` | `Q3_K` | `Q3_K` | |
|
159 |
+
| `blk.[3-60].ffn_down_shexp.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
|
160 |
+
| `blk.[3-60].ffn_gate_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
|
161 |
+
| `blk.[3-60].ffn_gate_inp.weight` | `F32` | `F32` | `F32` | |
|
162 |
+
| `blk.[3-60].ffn_gate_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
163 |
+
| `blk.[3-60].ffn_norm.weight` | `F32` | `F32` | `F32` | |
|
164 |
+
| `blk.[3-60].ffn_up_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
|
165 |
+
| `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
|
166 |
+
| `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
|
167 |
+
| **Important Matrix & Perplexity** | | | | |
|
168 |
+
| `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | n/a | ? |
|
169 |
+
| Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | ? | ? | ? |
|
170 |
+
|
171 |
+
|
172 |
+
</details>
|
173 |
+
|
174 |
+
#### imatrix
|
175 |
|
176 |
<details>
|
177 |
|
|
|
218 |
|
219 |
</details>
|
220 |
|
221 |
+
#### Quant Cookers Secret Recipe
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
222 |
|
223 |
```bash
|
224 |
#!/usr/bin/env bash
|
225 |
|
226 |
custom="
|
227 |
+
# Token embedding (GPU)
|
228 |
+
# NOTE: cannot be a repacked type due to tensor size
|
229 |
token_embd\.weight=q8_0
|
230 |
+
# output tensors (GPU)
|
231 |
output\.weight=q8_0
|
232 |
output_norm\.weight=q8_0
|
233 |
|
|
|
235 |
blk\.[0-2]\..*=q8_0
|
236 |
|
237 |
# All attention, weights, and bias tensors for MoE layers (3-60) (GPU)
|
238 |
+
# NOTE: attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128
|
239 |
blk\.[3-9]\.attn_.*=q8_0
|
240 |
blk\.[1-5][0-9]\.attn_.*=q8_0
|
241 |
blk\.60\.attn_.*=q8_0
|
|
|
257 |
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
|
258 |
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
|
259 |
|
260 |
+
# Routed Experts (3-60) (CPU)
|
261 |
+
# NOTE: Traditional wisdom suggests earlier layers use higher quants
|
262 |
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
|
263 |
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
|
264 |
blk\.60\.ffn_down_exps\.weight=iq3_k_r4
|
|
|
284 |
24
|
285 |
```
|
286 |
|
287 |
+
#### Perplexity
|
|
|
|
|
288 |
|
289 |
```bash
|
290 |
$ CUDA_VISIBLE_DEVICES="0," \
|
|
|
701 |
Final estimate: PPL = 3.5614 +/- 0.02001
|
702 |
```
|
703 |
|
704 |
+
#### Split
|
|
|
|
|
705 |
|
706 |
```bash
|
707 |
+
$ ./build/bin/llama-gguf-split \
|
708 |
--dry-run \
|
709 |
--split \
|
710 |
--split-max-size 50G \
|
|
|
714 |
|
715 |
</details>
|
716 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
717 |
## References
|
718 |
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
|
719 |
* [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|
benchmarks-01.png
ADDED
![]() |
Git LFS Details
|