Llama.cpp hybrid layer quantization of Llama 3.3 70B Instruct by meta-llama

Original model: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct

The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. Three quants are available for the model as follows:

Q3_S_H : Smallest Q3_K based quant available

  LAYER_TYPES='[
   [0 ,"Q4_K_M"],[1 ,"Q3_K_L"],[2 ,"Q3_K_M"],[3 ,"Q3_K_S"],[4 ,"Q3_K_S"],[5 ,"Q3_K_S"],[6 ,"Q3_K_S"],[7 ,"Q3_K_S"],
   [8 ,"Q3_K_S"],[9 ,"Q3_K_S"],[10,"Q3_K_S"],[11,"Q3_K_S"],[12,"Q3_K_S"],[13,"Q3_K_S"],[14,"Q3_K_S"],[15,"Q3_K_S"],
   [16,"Q3_K_S"],[17,"Q3_K_S"],[18,"Q3_K_S"],[19,"Q3_K_S"],[20,"Q3_K_S"],[21,"Q3_K_S"],[22,"Q3_K_S"],[23,"Q3_K_S"],
   [24,"Q3_K_S"],[25,"Q3_K_S"],[26,"Q3_K_S"],[27,"Q3_K_S"],[28,"Q3_K_S"],[29,"Q3_K_S"],[30,"Q3_K_S"],[31,"Q3_K_S"],
   [32,"Q3_K_S"],[33,"Q3_K_S"],[34,"Q3_K_S"],[35,"Q3_K_S"],[36,"Q3_K_S"],[37,"Q3_K_S"],[38,"Q3_K_S"],[39,"Q3_K_S"],
   [40,"Q3_K_M"],[41,"Q3_K_S"],[42,"Q3_K_M"],[43,"Q3_K_S"],[44,"Q3_K_M"],[45,"Q3_K_S"],[46,"Q3_K_M"],[47,"Q3_K_S"],
   [48,"Q3_K_M"],[49,"Q3_K_S"],[50,"Q3_K_M"],[51,"Q3_K_S"],[52,"Q3_K_M"],[53,"Q3_K_S"],[54,"Q3_K_M"],[55,"Q3_K_S"],
   [56,"Q3_K_M"],[57,"Q3_K_S"],[58,"Q3_K_M"],[59,"Q3_K_S"],[60,"Q3_K_M"],[61,"Q3_K_S"],[62,"Q3_K_M"],[63,"Q3_K_S"],
   [64,"Q3_K_M"],[65,"Q3_K_M"],[66,"Q3_K_M"],[67,"Q3_K_M"],[68,"Q3_K_M"],[69,"Q3_K_M"],[70,"Q3_K_M"],[71,"Q3_K_M"],
   [72,"Q3_K_M"],[73,"Q3_K_M"],[74,"Q3_K_M"],[75,"Q3_K_M"],[76,"Q3_K_M"],[77,"Q3_K_L"],[78,"Q4_K_S"],[79,"Q4_K_M"]
   ]'
  FLAGS="--token-embedding-type Q4_K --output-tensor-type Q5_K --layer-types-high"

Q3_K_H : Slightly larger Q3_K based quant

   LAYER_TYPES='[
   [0 ,"Q4_K_M"],[1 ,"Q3_K_L"],[2 ,"Q3_K_M"],[3 ,"Q3_K_M"],[4 ,"Q3_K_S"],[5 ,"Q3_K_M"],[6 ,"Q3_K_S"],[7 ,"Q3_K_M"],
   [8 ,"Q3_K_S"],[9 ,"Q3_K_M"],[10,"Q3_K_S"],[11,"Q3_K_M"],[12,"Q3_K_S"],[13,"Q3_K_M"],[14,"Q3_K_S"],[15,"Q3_K_M"],
   [16,"Q3_K_M"],[17,"Q3_K_S"],[18,"Q3_K_M"],[19,"Q3_K_S"],[20,"Q3_K_M"],[21,"Q3_K_S"],[22,"Q3_K_M"],[23,"Q3_K_S"],
   [24,"Q3_K_M"],[25,"Q3_K_S"],[26,"Q3_K_M"],[27,"Q3_K_S"],[28,"Q3_K_M"],[29,"Q3_K_S"],[30,"Q3_K_M"],[31,"Q3_K_S"],
   [32,"Q3_K_M"],[33,"Q3_K_S"],[34,"Q3_K_M"],[35,"Q3_K_S"],[36,"Q3_K_M"],[37,"Q3_K_S"],[38,"Q3_K_M"],[39,"Q3_K_S"],
   [40,"Q3_K_M"],[41,"Q3_K_S"],[42,"Q3_K_M"],[43,"Q3_K_S"],[44,"Q3_K_M"],[45,"Q3_K_S"],[46,"Q3_K_M"],[47,"Q3_K_S"],
   [48,"Q3_K_M"],[49,"Q3_K_S"],[50,"Q3_K_M"],[51,"Q3_K_S"],[52,"Q3_K_M"],[53,"Q3_K_S"],[54,"Q3_K_M"],[55,"Q3_K_S"],
   [56,"Q3_K_M"],[57,"Q3_K_S"],[58,"Q3_K_M"],[59,"Q3_K_S"],[60,"Q3_K_M"],[61,"Q3_K_S"],[62,"Q3_K_M"],[63,"Q3_K_S"],
   [64,"Q3_K_M"],[65,"Q3_K_M"],[66,"Q3_K_M"],[67,"Q3_K_M"],[68,"Q3_K_M"],[69,"Q3_K_M"],[70,"Q3_K_M"],[71,"Q3_K_M"],
   [72,"Q3_K_M"],[73,"Q3_K_M"],[74,"Q3_K_M"],[75,"Q3_K_M"],[76,"Q3_K_L"],[77,"Q3_K_L"],[78,"Q4_K_S"],[79,"Q4_K_M"]
   ]'
  FLAGS="--token-embedding-type Q4_K --output-tensor-type Q5_K --layer-types-high"

Q4_K_H : Largest and best performance quant

  LAYER_TYPES='[
   [0 ,"Q4_K_M"],[1 ,"Q4_K_M"],[2 ,"Q4_K_S"],[3 ,"Q4_K_S"],[4 ,"Q3_K_M"],[5 ,"Q3_K_L"],[6 ,"Q3_K_M"],[7 ,"Q3_K_L"],
   [8 ,"Q3_K_M"],[9 ,"Q3_K_L"],[10,"Q3_K_M"],[11,"Q3_K_L"],[12,"Q3_K_M"],[13,"Q3_K_L"],[14,"Q3_K_M"],[15,"Q3_K_L"],
   [16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_M"],[22,"Q3_K_L"],[23,"Q3_K_M"],
   [24,"Q3_K_L"],[25,"Q3_K_M"],[26,"Q3_K_L"],[27,"Q3_K_M"],[28,"Q3_K_L"],[29,"Q3_K_M"],[30,"Q3_K_L"],[31,"Q3_K_M"],
   [32,"Q3_K_L"],[33,"Q3_K_M"],[34,"Q3_K_L"],[35,"Q3_K_M"],[36,"Q3_K_L"],[37,"Q3_K_M"],[38,"Q3_K_L"],[39,"Q3_K_M"],
   [40,"Q3_K_L"],[41,"Q3_K_M"],[42,"Q3_K_L"],[43,"Q3_K_M"],[44,"Q3_K_L"],[45,"Q3_K_M"],[46,"Q3_K_L"],[47,"Q3_K_M"],
   [48,"Q3_K_L"],[49,"Q3_K_M"],[50,"Q3_K_L"],[51,"Q3_K_M"],[52,"Q3_K_L"],[53,"Q3_K_M"],[54,"Q3_K_L"],[55,"Q3_K_M"],
   [56,"Q3_K_L"],[57,"Q3_K_M"],[58,"Q3_K_L"],[59,"Q3_K_M"],[60,"Q3_K_L"],[61,"Q3_K_M"],[62,"Q3_K_L"],[63,"Q3_K_M"],
   [64,"Q4_K_S"],[65,"Q3_K_L"],[66,"Q4_K_S"],[67,"Q3_K_L"],[68,"Q4_K_S"],[69,"Q3_K_L"],[70,"Q4_K_S"],[71,"Q3_K_L"],
   [72,"Q4_K_S"],[73,"Q4_K_S"],[74,"Q4_K_M"],[75,"Q4_K_S"],[76,"Q4_K_M"],[77,"Q5_K_S"],[78,"Q5_K_M"],[79,"Q6_K"  ]
   ]'
   FLAGS="--token-embedding-type Q4_K --output-tensor-type Q6_K"

All three quants were optimized to maintain knowledge preservation and reasoning performance using a small set of curated test/evaluation prompts. All three quants score 100% on the eval prompts but the Q3 quants sometimes get a little goofy, giving wrong answer then correcting itself with the right one, or adding some non sequiter with the answer etc. Q4_K_H is rock solid. Note that use of Q2_K or Q2_K_S was not possible with this model since any Q2 use even at deep layers threw the model immediately into either noncoherence or large knowledge loss.

Comparison:

Quant size PPL Comment
Q3_S_H 32.6e9 4.8 Q3_K dominant with Q4_K embedding
Q3_K_H 33.4e9 4.8 " "
Q3_K_M 34.3e9 4.9 Fails parts of eval prompt set
Q4_K_H 37.5e9 4.5 Best available quant
IQ4_XS 38.3e9 4.4 Q4_K embedding Q6_K output

Usage:

This model may be used together with fixie-ai ultravox-v0_5-llama-3_3-70b or ultravox-v0_6-llama-3_3-70b to enable it to process audio (.mp3 and .wav files) and text inputs and generate text outputs. The mmproj file are made available here: https://huggingface.co/steampunque/ultravox-v0_5-llama-3_3-70b-Hybrid-GGUF , https://huggingface.co/steampunque/ultravox-v0_6-llama-3_3-70b-Hybrid-GGUF More information about running multimedia may be found in the docs in the mtmd readme in the tools directory of the llama.cpp source tree https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/README.md.

Benchmarks:

A partial set of benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm

Download the file from below:

Link Type Size/e9 B Notes
Llama-3.3-70B-Instruct.Q3_S_H.gguf Q3_S_H 32.6e9 B 1.7B smaller than Q3_K_M
Llama-3.3-70B-Instruct.Q3_K_H.gguf Q3_K_H 33.4e9 B 0.9B smaller than Q3_K_M
Llama-3.3-70B-Instruct.Q4_K_H.gguf Q4_K_H 37.5e9 B 0.8B smaller than IQ4_XS
ultravox-v0_5-llama-3_3-70b.mmproj.gguf mmproj 1.38e9 B multimedia projector
ultravox-v0_6-llama-3_3-70b.mmproj.gguf mmproj 1.38e9 B multimedia projector

A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:

https://github.com/ggml-org/llama.cpp/discussions/13040

Downloads last month
81
GGUF
Model size
70.6B params
Architecture
llama
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for steampunque/Llama-3.3-70B-Instruct-Hybrid-GGUF

Quantized
(125)
this model