Joseph717171/Llama-3.1-8B-Instruct-UD-OQ8_0-F32.EQ8_0-F32.IQ4_K-Q8_0-GGUF

Custom GGUF quants of meta-llama/Llama-3.1-8B-Instruct, which uses unsloth/Llama-3.1-8B-Instruct-GGUF's imatrix and quant-schemes (except where F32 is substituted in for BF16) - where the Output Tensors and Embeddings are kept at F32 or quantized to Q8_0. Enjoy! (🧠🔥🚀)❤️🦥🕊️

This repo is a WIP. As, I have to use HuggingFace's model viewer to meticuluously note the quantized/unquantized differences in each tensor/layer of the model to be able to match Unsloths quant-scheme.

As we are combining two different naming schemes into one, here's a little note to ease confusion:

IQ8_0_XL:

IQ8_0_XL == IQ8_0 with attention (K, Q, V) and FFN (Feed-Forward Network) in F32 (from blk0-1 and blk29-31), and attention (V) in F32 (from blk0-31)

"
--tensor-type \.(0|1|29|30|31)\.attn_k=f32
--tensor-type \.(0|1|29|30|31)\.attn_q=f32
--tensor-type \.([0-9]|1[0-9]|2[0-9]|30|31)\.attn_v=f32
--tensor-type \.([0-9]|1[0-9]|2[0-9]|30|31)\.attn_output=q6_k


--tensor-type \.(0|1|29|30|31)\.ffn_down=f32
--tensor-type \.(0|1|29|30|31)\.ffn_gate=f32
--tensor-type \.(0|1|29|30|31)\.ffn_up=f32


--tensor-type \.([2-9]|1[0-9]|2[0-8])\.attn_k=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.attn_q=q8_0


--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_down=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_gate=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_up=q8_0
"

IQ6_K_XL:

IQ6_K_XL == IQ8_0 w/attn_output = Q6_K (the rest of the model is in Q8_0).

"
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_k=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_q=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_v=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_output=q6_k

--tensor-type \.([0-6]|29|3[0-1])\.ffn_down=q8_0
--tensor-type \.([0-1]|29|3[0-1])\.ffn_gate=q8_0
--tensor-type \.([0-1]|29|3[0-1])\.ffn_up=q8_0

--tensor-type \.(7|1[0-9]|2[0-8])\.ffn_down=q6_k
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_gate=q6_k
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_up=q6_k
"