Optimal Quants from a performance perspective on Apple Sillicon
@danielhanchen
I imagine this applies to all quants by you. What quants are most efficient to run on Apple Silicon devices?
This article says:
To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.
I see only GGUFs two of this list, i.e. IQ4_NL
and Q4.0
. For the most efficient inference with llama.cpp
, should I choose one of these two?
And, if I am not mistaken, this means that I won't get the benefit of unsloth dynamic quants when I use these?
I have an Apple M4 Max with 64GB of RAM and can run all quants comfortably, but I am looking to optimize for efficiency - so that this model is forever running in the background.
I think Q4_1 is best for Apple Silicon.
All quants including non UD- use our calibration dataset! So there are still some benefits!