Tensor Type Testing
Skip to the bottom of this document for a TL;DR
For more info, see llama.cpp #12511: Handle user-defined quantization levels for additional tensors by @EAddario
Testing done by @ddh0 using this branch as of committ 5a304b8. Using libllama built for Linux CUDA.
Quantization naming scheme
Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf
for example Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf
:
- Model is Llama 3.1 8B Instruct
- TYPE_EMBD (token embeddings) are in Q4_K
- TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
- TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
- TYPE_OUTPUT (output tensor) is in Q8_0
Command template
TYPE_EMBD=GGML_TYPE
TYPE_FFN=GGML_TYPE
TYPE_ATTN=GGML_TYPE
TYPE_OUTPUT=GGML_TYPE
SRC_GGUF=/my/model/orig.gguf
DST_GGUF=/my/model/quant.gguf
N_THREADS=4
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Commands used for Llama 3.2
Crush token embeddings to Q2_K, otherwise Q8_0
TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush FFN to Q2_K, otherwise Q8_0
TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush attention to Q2_K, otherwise Q8_0
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush output tensor to Q2_K, otherwise Q8_0 ⚠️
This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Raw results for Llama 3.2 3B
Number of input texts: 10
Shortest input length in tokens: 55
Longest input length in tokens: 4678
Average input length in tokens: 1605.5
Total number of input tokens: 16055
--------------------------------------------------------------------------------
Evaluating baseline model Llama-3.2-3B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
-- Prompt 0: 1.2261667251586914
-- Prompt 1: 1.1347604990005493
-- Prompt 2: 1.388033390045166
-- Prompt 3: 1.1053369045257568
-- Prompt 4: 1.7510676383972168
-- Prompt 5: 4.586221218109131
-- Prompt 6: 1.3651360273361206
-- Prompt 7: 0.8970077037811279
-- Prompt 8: 0.3409916162490845
-- Prompt 9: 1.2506738901138306
Average MSD: 1.5045396089553833
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.3589555025100708
-- Prompt 1: 0.1420530527830124
-- Prompt 2: 0.3871675133705139
-- Prompt 3: 0.38336610794067383
-- Prompt 4: 0.4630553722381592
-- Prompt 5: 0.3928600549697876
-- Prompt 6: 0.46294596791267395
-- Prompt 7: 0.41983363032341003
-- Prompt 8: 0.0822080597281456
-- Prompt 9: 0.3548887372016907
Average MSD: 0.34473341703414917
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 4.409396648406982
-- Prompt 1: 2.431891679763794
-- Prompt 2: 5.892056941986084
-- Prompt 3: 4.688146591186523
-- Prompt 4: 6.351741313934326
-- Prompt 5: 8.826679229736328
-- Prompt 6: 4.506043434143066
-- Prompt 7: 4.613113880157471
-- Prompt 8: 1.0596126317977905
-- Prompt 9: 4.1558661460876465
Average MSD: 4.693454742431641
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 1.0618470907211304
-- Prompt 1: 1.1212399005889893
-- Prompt 2: 1.3122810125350952
-- Prompt 3: 0.9195016026496887
-- Prompt 4: 1.201547622680664
-- Prompt 5: 5.760651111602783
-- Prompt 6: 1.0914928913116455
-- Prompt 7: 0.9646959900856018
-- Prompt 8: 0.41648873686790466
-- Prompt 9: 1.4317259788513184
Average MSD: 1.5281471014022827
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
-- Prompt 0: 0.0023212190717458725
-- Prompt 1: 0.0014450754970312119
-- Prompt 2: 0.003914575092494488
-- Prompt 3: 0.002514646854251623
-- Prompt 4: 0.003313937224447727
-- Prompt 5: 0.004224818665534258
-- Prompt 6: 0.0026909655425697565
-- Prompt 7: 0.0033839084208011627
-- Prompt 8: 0.0015104531776160002
-- Prompt 9: 0.002354747150093317
Average MSD: 0.0027674345765262842
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
--------------------------------------------------------------------------------
Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
--------------------------------------------------------------------------------
Commands used for Qwen2.5-14B
Crush token embeddings to Q2_K, otherwise Q8_0
TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush FFNs to Q2_K, otherwise Q8_0
TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush attention to Q2_K, otherwise Q8_0
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Crush output tensor to Q2_K, otherwise Q8_0
TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16
./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS
Raw results for Qwen2.5-14B
Number of input texts: 10
Shortest input length in tokens: 60
Longest input length in tokens: 4801
Average input length in tokens: 1589.3
Total number of input tokens: 15893
--------------------------------------------------------------------------------
Evaluating baseline model Qwen2.5-14B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
-- Prompt 0: 1.568434476852417
-- Prompt 1: 1.8605916500091553
-- Prompt 2: 1.2912431955337524
-- Prompt 3: 1.3367090225219727
-- Prompt 4: 1.1364308595657349
-- Prompt 5: 2.3384993076324463
-- Prompt 6: 1.2926896810531616
-- Prompt 7: 1.4084643125534058
-- Prompt 8: 0.32443684339523315
-- Prompt 9: 1.3756331205368042
Average MSD: 1.3933132886886597
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.012962134554982185
-- Prompt 1: 0.019185630604624748
-- Prompt 2: 0.05430002510547638
-- Prompt 3: 0.008174948394298553
-- Prompt 4: 0.011592703871428967
-- Prompt 5: 0.012105505913496017
-- Prompt 6: 0.007557644974440336
-- Prompt 7: 0.01957087405025959
-- Prompt 8: 0.013395288027822971
-- Prompt 9: 0.007488884497433901
Average MSD: 0.01663336530327797
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 2.483222246170044
-- Prompt 1: 2.20788836479187
-- Prompt 2: 2.2648935317993164
-- Prompt 3: 2.175588607788086
-- Prompt 4: 1.624481439590454
-- Prompt 5: 4.104475498199463
-- Prompt 6: 2.0161893367767334
-- Prompt 7: 2.0660784244537354
-- Prompt 8: 0.46407243609428406
-- Prompt 9: 2.1939690113067627
Average MSD: 2.160086154937744
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 0.7283403277397156
-- Prompt 1: 1.0912593603134155
-- Prompt 2: 0.9022651314735413
-- Prompt 3: 0.4880850911140442
-- Prompt 4: 0.29713207483291626
-- Prompt 5: 0.6994995474815369
-- Prompt 6: 0.45846545696258545
-- Prompt 7: 0.5286242365837097
-- Prompt 8: 0.2947601079940796
-- Prompt 9: 0.5722559690475464
Average MSD: 0.6060687303543091
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
-- Prompt 0: 1.2783535718917847
-- Prompt 1: 0.4481557607650757
-- Prompt 2: 1.1880418062210083
-- Prompt 3: 1.0997036695480347
-- Prompt 4: 0.8093082308769226
-- Prompt 5: 0.6486296057701111
-- Prompt 6: 1.1238276958465576
-- Prompt 7: 1.1459368467330933
-- Prompt 8: 0.23579858243465424
-- Prompt 9: 1.238993525505066
Average MSD: 0.9216748476028442
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
-- Prompt 0: 0.0059487177059054375
-- Prompt 1: 0.004823403432965279
-- Prompt 2: 0.011750683188438416
-- Prompt 3: 0.004459250718355179
-- Prompt 4: 0.004037810489535332
-- Prompt 5: 0.0039064036682248116
-- Prompt 6: 0.004684466868638992
-- Prompt 7: 0.004520604852586985
-- Prompt 8: 0.004727284424006939
-- Prompt 9: 0.004541514907032251
Average MSD: 0.0053400141187012196
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
--------------------------------------------------------------------------------
Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
--------------------------------------------------------------------------------
TL;DR
Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
Q2_K | Crush TYPE_EMBD | Crush TYPE_FFN | Crush TYPE_ATTN | Crush TYPE_OUTPUT | Q8_0 | |
---|---|---|---|---|---|---|
Llama 3.2 3B | 1.504 | 0.344 | 4.693 | 1.528 | N/A | 0.002 |
Qwen2.5-14B | 1.393 | 0.016 | 2.160 | 0.606 | 0.921 | 0.005 |
Average | 1.44 | 0.18 | 3.42 | 1.06 | 0.921 | 0.0035 |
In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.