unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF · Optimal Quants from a performance perspective on Apple Sillicon

14 days ago

@danielhanchen I imagine this applies to all quants by you. What quants are most efficient to run on Apple Silicon devices?
This article says:

To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.

I see only GGUFs two of this list, i.e. IQ4_NL and Q4.0. For the most efficient inference with llama.cpp, should I choose one of these two?
And, if I am not mistaken, this means that I won't get the benefit of unsloth dynamic quants when I use these?

I have an Apple M4 Max with 64GB of RAM and can run all quants comfortably, but I am looking to optimize for efficiency - so that this model is forever running in the background.

danielhanchen

Unsloth AI org 13 days ago

I think Q4_1 is best for Apple Silicon.

All quants including non UD- use our calibration dataset! So there are still some benefits!

qikchen

13 days ago

@danielhanchen Thanks! I see the performance improvement.

qikchen changed discussion status to closed 13 days ago