Tensor Type Testing

Skip to the bottom of this document for a TL;DR

For more info, see llama.cpp #12511: Handle user-defined quantization levels for additional tensors by @EAddario

Testing done by @ddh0 using this branch as of committ 5a304b8. Using libllama built for Linux CUDA.

Quantization naming scheme

Model-Name-E{TYPE_EMBD}-F{TYPE_FFN}-A{TYPE_ATTN}-O{TYPE_OUTPUT}.gguf

for example Llama-3.1-8B-Instruct-EQ4_K-FQ4_K-AQ8_0-OQ8_0.gguf:

Model is Llama 3.1 8B Instruct
TYPE_EMBD (token embeddings) are in Q4_K
TYPE_FFN (MLP / feed-forward tensors) are in Q4_K
TYPE_ATTN (K,Q,V attention and attention output tensors) are in Q8_0
TYPE_OUTPUT (output tensor) is in Q8_0

Command template

TYPE_EMBD=GGML_TYPE
TYPE_FFN=GGML_TYPE
TYPE_ATTN=GGML_TYPE
TYPE_OUTPUT=GGML_TYPE
SRC_GGUF=/my/model/orig.gguf
DST_GGUF=/my/model/quant.gguf
N_THREADS=4

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Commands used for Llama 3.2

Crush token embeddings to Q2_K, otherwise Q8_0

TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush FFN to Q2_K, otherwise Q8_0

TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush attention to Q2_K, otherwise Q8_0

TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush output tensor to Q2_K, otherwise Q8_0 ⚠️

This quant was not included in the testing because Llama 3.2 3B has no output tensor! The resulting file is the same as a normal Q8_0.

TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Llama-3.2-3B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Llama-3.2-3B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Raw results for Llama 3.2 3B

          Number of input texts: 10
Shortest input length in tokens: 55
 Longest input length in tokens: 4678
 Average input length in tokens: 1605.5
   Total number of input tokens: 16055
--------------------------------------------------------------------------------
Evaluating baseline model Llama-3.2-3B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q2_K.gguf:
-- Prompt 0: 1.2261667251586914
-- Prompt 1: 1.1347604990005493
-- Prompt 2: 1.388033390045166
-- Prompt 3: 1.1053369045257568
-- Prompt 4: 1.7510676383972168
-- Prompt 5: 4.586221218109131
-- Prompt 6: 1.3651360273361206
-- Prompt 7: 0.8970077037811279
-- Prompt 8: 0.3409916162490845
-- Prompt 9: 1.2506738901138306
Average MSD: 1.5045396089553833
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.3589555025100708
-- Prompt 1: 0.1420530527830124
-- Prompt 2: 0.3871675133705139
-- Prompt 3: 0.38336610794067383
-- Prompt 4: 0.4630553722381592
-- Prompt 5: 0.3928600549697876
-- Prompt 6: 0.46294596791267395
-- Prompt 7: 0.41983363032341003
-- Prompt 8: 0.0822080597281456
-- Prompt 9: 0.3548887372016907
Average MSD: 0.34473341703414917
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 4.409396648406982
-- Prompt 1: 2.431891679763794
-- Prompt 2: 5.892056941986084
-- Prompt 3: 4.688146591186523
-- Prompt 4: 6.351741313934326
-- Prompt 5: 8.826679229736328
-- Prompt 6: 4.506043434143066
-- Prompt 7: 4.613113880157471
-- Prompt 8: 1.0596126317977905
-- Prompt 9: 4.1558661460876465
Average MSD: 4.693454742431641
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 1.0618470907211304
-- Prompt 1: 1.1212399005889893
-- Prompt 2: 1.3122810125350952
-- Prompt 3: 0.9195016026496887
-- Prompt 4: 1.201547622680664
-- Prompt 5: 5.760651111602783
-- Prompt 6: 1.0914928913116455
-- Prompt 7: 0.9646959900856018
-- Prompt 8: 0.41648873686790466
-- Prompt 9: 1.4317259788513184
Average MSD: 1.5281471014022827
--------------------------------------------------------------------------------
Now processing: Llama-3.2-3B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Llama-3.2-3B-BF16.gguf vs. Llama-3.2-3B-Q8_0.gguf:
-- Prompt 0: 0.0023212190717458725
-- Prompt 1: 0.0014450754970312119
-- Prompt 2: 0.003914575092494488
-- Prompt 3: 0.002514646854251623
-- Prompt 4: 0.003313937224447727
-- Prompt 5: 0.004224818665534258
-- Prompt 6: 0.0026909655425697565
-- Prompt 7: 0.0033839084208011627
-- Prompt 8: 0.0015104531776160002
-- Prompt 9: 0.002354747150093317
Average MSD: 0.0027674345765262842
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Llama-3.2-3B-BF16.gguf:
--------------------------------------------------------------------------------
                                      Llama-3.2-3B-Q2_K.gguf -- 1.5045396089553833
                   Llama-3.2-3B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.34473341703414917
                   Llama-3.2-3B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 4.693454742431641
                   Llama-3.2-3B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 1.5281471014022827
                                      Llama-3.2-3B-Q8_0.gguf -- 0.0027674345765262842
--------------------------------------------------------------------------------

Commands used for Qwen2.5-14B

Crush token embeddings to Q2_K, otherwise Q8_0

TYPE_EMBD=Q2_K
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush FFNs to Q2_K, otherwise Q8_0

TYPE_EMBD=Q8_0
TYPE_FFN=Q2_K
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush attention to Q2_K, otherwise Q8_0

TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q2_K
TYPE_OUTPUT=Q8_0
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Crush output tensor to Q2_K, otherwise Q8_0

TYPE_EMBD=Q8_0
TYPE_FFN=Q8_0
TYPE_ATTN=Q8_0
TYPE_OUTPUT=Q2_K
SRC_GGUF=/opt/workspace/gguf/Qwen2.5-14B-BF16.gguf
DST_GGUF=/opt/workspace/gguf/Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
N_THREADS=16

./llama.cpp/build/bin/llama-quantize --token-embedding-type $TYPE_EMBD --tensor-type ffn_down=$TYPE_FFN --tensor-type ffn_gate=$TYPE_FFN --tensor-type ffn_up=$TYPE_FFN --tensor-type attn_k=$TYPE_ATTN --tensor-type attn_q=$TYPE_ATTN --tensor-type attn_v=$TYPE_ATTN --tensor-type attn_out=$TYPE_ATTN --output-tensor-type $TYPE_OUTPUT $SRC_GGUF $DST_GGUF $TYPE_FFN $N_THREADS

Raw results for Qwen2.5-14B

          Number of input texts: 10
Shortest input length in tokens: 60
 Longest input length in tokens: 4801
 Average input length in tokens: 1589.3
   Total number of input tokens: 15893
--------------------------------------------------------------------------------
Evaluating baseline model Qwen2.5-14B-BF16.gguf...
Load model...
Evaluate prompts...
Unload model...
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q2_K.gguf:
-- Prompt 0: 1.568434476852417
-- Prompt 1: 1.8605916500091553
-- Prompt 2: 1.2912431955337524
-- Prompt 3: 1.3367090225219727
-- Prompt 4: 1.1364308595657349
-- Prompt 5: 2.3384993076324463
-- Prompt 6: 1.2926896810531616
-- Prompt 7: 1.4084643125534058
-- Prompt 8: 0.32443684339523315
-- Prompt 9: 1.3756331205368042
Average MSD: 1.3933132886886597
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 0.012962134554982185
-- Prompt 1: 0.019185630604624748
-- Prompt 2: 0.05430002510547638
-- Prompt 3: 0.008174948394298553
-- Prompt 4: 0.011592703871428967
-- Prompt 5: 0.012105505913496017
-- Prompt 6: 0.007557644974440336
-- Prompt 7: 0.01957087405025959
-- Prompt 8: 0.013395288027822971
-- Prompt 9: 0.007488884497433901
Average MSD: 0.01663336530327797
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf:
-- Prompt 0: 2.483222246170044
-- Prompt 1: 2.20788836479187
-- Prompt 2: 2.2648935317993164
-- Prompt 3: 2.175588607788086
-- Prompt 4: 1.624481439590454
-- Prompt 5: 4.104475498199463
-- Prompt 6: 2.0161893367767334
-- Prompt 7: 2.0660784244537354
-- Prompt 8: 0.46407243609428406
-- Prompt 9: 2.1939690113067627
Average MSD: 2.160086154937744
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf:
-- Prompt 0: 0.7283403277397156
-- Prompt 1: 1.0912593603134155
-- Prompt 2: 0.9022651314735413
-- Prompt 3: 0.4880850911140442
-- Prompt 4: 0.29713207483291626
-- Prompt 5: 0.6994995474815369
-- Prompt 6: 0.45846545696258545
-- Prompt 7: 0.5286242365837097
-- Prompt 8: 0.2947601079940796
-- Prompt 9: 0.5722559690475464
Average MSD: 0.6060687303543091
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf:
-- Prompt 0: 1.2783535718917847
-- Prompt 1: 0.4481557607650757
-- Prompt 2: 1.1880418062210083
-- Prompt 3: 1.0997036695480347
-- Prompt 4: 0.8093082308769226
-- Prompt 5: 0.6486296057701111
-- Prompt 6: 1.1238276958465576
-- Prompt 7: 1.1459368467330933
-- Prompt 8: 0.23579858243465424
-- Prompt 9: 1.238993525505066
Average MSD: 0.9216748476028442
--------------------------------------------------------------------------------
Now processing: Qwen2.5-14B-Q8_0.gguf
Load model...
Evaluate prompts...
Unload model...
Compute MSD...
Mean-Squared Deviation - Qwen2.5-14B-BF16.gguf vs. Qwen2.5-14B-Q8_0.gguf:
-- Prompt 0: 0.0059487177059054375
-- Prompt 1: 0.004823403432965279
-- Prompt 2: 0.011750683188438416
-- Prompt 3: 0.004459250718355179
-- Prompt 4: 0.004037810489535332
-- Prompt 5: 0.0039064036682248116
-- Prompt 6: 0.004684466868638992
-- Prompt 7: 0.004520604852586985
-- Prompt 8: 0.004727284424006939
-- Prompt 9: 0.004541514907032251
Average MSD: 0.0053400141187012196
--------------------------------------------------------------------------------
Average Mean-Squared Deviation compared to Qwen2.5-14B-BF16.gguf:
--------------------------------------------------------------------------------
                                       Qwen2.5-14B-Q2_K.gguf -- 1.3933132886886597
                    Qwen2.5-14B-EQ2_K-FQ8_0-AQ8_0-OQ8_0.gguf -- 0.01663336530327797
                    Qwen2.5-14B-EQ8_0-FQ2_K-AQ8_0-OQ8_0.gguf -- 2.160086154937744
                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ2_K-OQ8_0.gguf -- 0.6060687303543091
                    Qwen2.5-14B-EQ8_0-FQ8_0-AQ8_0-OQ2_K.gguf -- 0.9216748476028442
                                       Qwen2.5-14B-Q8_0.gguf -- 0.0053400141187012196
--------------------------------------------------------------------------------

TL;DR

Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):

	Q2_K	Crush TYPE_EMBD	Crush TYPE_FFN	Crush TYPE_ATTN	Crush TYPE_OUTPUT	Q8_0
Llama 3.2 3B	1.504	0.344	4.693	1.528	N/A	0.002
Qwen2.5-14B	1.393	0.016	2.160	0.606	0.921	0.005
Average	1.44	0.18	3.42	1.06	0.921	0.0035

In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.