Exl3 might be SOTA?

#1
by Tibbnak - opened

This is a heck of a quant format you've got, and fairly implemented and usable now too. There might genuinely not be another public quant method out there right now that could even come close to beating it (at least, one that actually exists outside of words on a document).

Looking at the Kl divergence graphs, you've effectively made a quantization format where 6bpw has become virtually indistinguishable from 8-bit, on top of the inference engine itself already being more efficient.

Even your 4bpw (for this model at least) is around a 5_K_S equivalent, and the 3.5bpw is on par with a 4_K_M model despite being noticeably smaller than an IQ4_XS.

I think it might be SOTA, yes. It's not that surprising since it's based on QTIP, and QTIP is brilliant.

Now, there is a github repo by the original authors, and a couple of QTIP models on HF, but I can't actually get any of that to work. I gave up after some days of trying, but I was always going to roll my own implementation anyway, so whatever. I think, though, that it's safe to assume the few QTIP models on HF would be at least as good as their EXL3 counterparts, since the underlying method is roughly the same. The main idea behind EXL3 was to make it all approachable and easy, because working with the reference QTIP code is very hard (respectfully.)

There's some more information here and some early benchmarks. I'll update those soon, I think, since I mostly want to focus on KL divergence going forward. Using perplexity to compare quants this way is somewhat flawed.

Sign up or log in to comment