YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Custom GGUF quants of meta-llama/Llama-3.1-8B-Instruct, which uses unsloth/Llama-3.1-8B-Instruct-GGUF's imatrix and quant-schemes (except where F32 is substituted in for BF16) - where the Output Tensors and Embeddings are kept at F32 or quantized to Q8_0. Enjoy! (๐Ÿง ๐Ÿ”ฅ๐Ÿš€)โค๏ธ๐Ÿฆฅ๐Ÿ•Š๏ธ

This repo is a WIP. As, I have to use HuggingFace's model viewer to meticuluously note the quantized/unquantized differences in each tensor/layer of the model to be able to match Unsloths quant-scheme.

As we are combining two different naming schemes into one, here's a little note to ease confusion:

IQ8_0_XL:

  • IQ8_0_XL == IQ8_0 with attention (K, Q, V) and FFN (Feed-Forward Network) in F32 (from blk0-1 and blk29-31), and attention (V) in F32 (from blk0-31)
"
--tensor-type \.(0|1|29|30|31)\.attn_k=f32
--tensor-type \.(0|1|29|30|31)\.attn_q=f32
--tensor-type \.([0-9]|1[0-9]|2[0-9]|30|31)\.attn_v=f32
--tensor-type \.([0-9]|1[0-9]|2[0-9]|30|31)\.attn_output=q6_k


--tensor-type \.(0|1|29|30|31)\.ffn_down=f32
--tensor-type \.(0|1|29|30|31)\.ffn_gate=f32
--tensor-type \.(0|1|29|30|31)\.ffn_up=f32


--tensor-type \.([2-9]|1[0-9]|2[0-8])\.attn_k=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.attn_q=q8_0


--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_down=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_gate=q8_0
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_up=q8_0
"

IQ6_K_XL:

  • IQ6_K_XL == IQ8_0 w/attn_output = Q6_K (the rest of the model is in Q8_0).
"
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_k=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_q=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_v=q8_0
--tensor-type \.([0-9]|1[0-9]|2[0-9]|3[0-1])\.attn_output=q6_k

--tensor-type \.([0-6]|29|3[0-1])\.ffn_down=q8_0
--tensor-type \.([0-1]|29|3[0-1])\.ffn_gate=q8_0
--tensor-type \.([0-1]|29|3[0-1])\.ffn_up=q8_0

--tensor-type \.(7|1[0-9]|2[0-8])\.ffn_down=q6_k
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_gate=q6_k
--tensor-type \.([2-9]|1[0-9]|2[0-8])\.ffn_up=q6_k
"
Downloads last month
244
GGUF
Model size
8.03B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support