Optimal Quants from a performance perspective on Apple Sillicon

#5
by qikchen - opened

@danielhanchen I imagine this applies to all quants by you. What quants are most efficient to run on Apple Silicon devices?
This article says:

To maximize efficiency, especially on Apple Silicon and ARM devices, we now also add Q4_NL, Q5.1, Q5.0, Q4.1, and Q4.0 formats.

I see only GGUFs two of this list, i.e. IQ4_NL and Q4.0. For the most efficient inference with llama.cpp, should I choose one of these two?
And, if I am not mistaken, this means that I won't get the benefit of unsloth dynamic quants when I use these?

I have an Apple M4 Max with 64GB of RAM and can run all quants comfortably, but I am looking to optimize for efficiency - so that this model is forever running in the background.

Unsloth AI org

I think Q4_1 is best for Apple Silicon.

All quants including non UD- use our calibration dataset! So there are still some benefits!

@danielhanchen Thanks! I see the performance improvement.

qikchen changed discussion status to closed

Sign up or log in to comment