ubergarm commited on
Commit
821178b
·
0 Parent(s):

initial commit

Browse files
Files changed (2) hide show
  1. .gitattributes +38 -0
  2. README.md +127 -0
.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.dat filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ quantized_by: ubergarm
3
+ pipeline_tag: text-generation
4
+ base_model: Qwen/Qwen3-30B-A3B
5
+ license: apache-2.0
6
+ license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
7
+ base_model_relation: quantized
8
+ tags:
9
+ - imatrix
10
+ - qwen3_moe
11
+ - conversational
12
+ - ik_llama.cpp
13
+ ---
14
+
15
+ ## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B
16
+
17
+ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants. Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
18
+
19
+ These quants provide best in class quality for the given memory footprint.
20
+
21
+ ## Big Thanks
22
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
23
+
24
+ Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
25
+
26
+ Excited to share and learn together. Thanks!
27
+
28
+ ## Quant Collection
29
+ So far these are my best recipes offering the great quality in good memory footprint breakpoints.
30
+
31
+ #### ubergarm/Qwen3-30B-A3B-mix-IQ4_K
32
+ This quant is provides the best in class quality while providing good speed performance. This quant is designed to run with over 32k context using GPU performant f16 KV-Cache in under 24GB VRAM GPU. You could also try offload to CPU using `-nkvo -ctk q8_0 -ctv q8_0` and use `-rtr` for RAM optimized tensor packing on startup (without `mmap()` support) taking ~18396MiB of VRAM or less by offloading repeating layers to CPU as well at decreased speed.
33
+
34
+ ```
35
+ 17.679 GiB (4.974 BPW)
36
+
37
+ f32: 241 tensors
38
+ q8_0: 6 tensors
39
+ iq4_k: 96 tensors
40
+ iq5_k: 48 tensors
41
+ iq6_k: 188 tensors
42
+
43
+ Final estimate: PPL = 9.1184 +/- 0.07278 (wiki-test.raw, compare to BF16 at 9.0703 +/- 0.07223)
44
+ *NOTE*: Benchmarks including PPL with `wiki.test.raw` and KLD with `ubergarm-kld-test-corpus.txt` are looking interesting! Will publish soon!
45
+ ```
46
+
47
+ ## Quick Start
48
+ #### `ik_llama.cpp` API server for hybrid GPU+CPU inferencing
49
+ ```bash
50
+ # This example for ~21468MiB VRAM Usage
51
+ ./build/bin/llama-server
52
+ --model ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K \
53
+ --alias ubergarm/Qwen3-30B-A3B-mix-IQ4_K \
54
+ -fa \
55
+ -ctk f16 -ctv f16 \
56
+ -c 32768 \
57
+ -fmoe \
58
+ -ngl 99 \
59
+ --threads 1
60
+ --host 127.0.0.1 \
61
+ --port 8080
62
+ ```
63
+
64
+ If you want more context and/or less VRAM usage, you can try:
65
+ * Smaller KV Cache quantization `-ctk q4_0 -ctv q4_0`
66
+
67
+ If you want more throughput you could try:
68
+ * Increase context to max limit for your VRAM
69
+ * use `--parallel N` to have (context / N) available per slot
70
+ * use an asyncio client and keep the queue full
71
+
72
+ ## Quantization
73
+ <details>
74
+
75
+ <summary>👈Secret Recipe</summary>
76
+
77
+ ```bash
78
+ #!/usr/bin/env bash
79
+
80
+ custom="
81
+ # Attention (give Layer 0 a little extra as it scores lowest on cosine-similarity score)
82
+ blk\.0\.attn_k.*=q8_0
83
+ blk\.0\.attn_q.*=q8_0
84
+ blk\.0\.attn_v.*=q8_0
85
+ blk\.0\.attn_output.*=q8_0
86
+
87
+ blk\..*\.attn_k.*=iq6_k
88
+ blk\..*\.attn_q.*=iq6_k
89
+ blk\..*\.attn_v.*=iq6_k
90
+ blk\..*\.attn_output.*=iq6_k
91
+
92
+ # Token Embedding (put these second so attn_output regex doesn catch too early)
93
+ token_embd\.weight=q8_0
94
+ output\.weight=q8_0
95
+
96
+ # Experts
97
+ blk\..*\.ffn_down_exps\.weight=iq5_k
98
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
99
+ "
100
+
101
+ custom=$(
102
+ echo "$custom" | grep -v '^#' | \
103
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
104
+ )
105
+
106
+ ./build/bin/llama-quantize \
107
+ --custom-q "$custom" \
108
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/imatrix-Qwen3-30B-A3B.dat \
109
+ /mnt/raid/models/Qwen/Qwen3-30B-A3B/Qwen3-30B-A3B-BF16-00001-of-00002.gguf \
110
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf \
111
+ IQ4_K \
112
+ 24
113
+ ```
114
+
115
+ </details>
116
+
117
+ ## Discussion
118
+ *TODO*: Discuss some about comparing quants e.g. bartowski, unsloth, and mradermacher including "quality" and "speed".
119
+
120
+ ## Benchmarks
121
+ In first tests with `llama-sweep-bench` I'm getting over 1600 tok/sec PP and 105 tok/sec TG on my 3090TI FE 24GB VRAM. It does slow down of course as it gets deeper into the full 32k context. Check the linked Benchmarks Discussion for updates as this is all pretty fresh right now. Pretty amazing performance both in terms of generation quality and speed for a model this size!
122
+
123
+ ## References
124
+ * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
125
+ * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
126
+ * [ik_llama.cpp Benchmarks Discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/357)
127
+ * [imatrix calibration_data_v5_rc.txt](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt)