Q3_K_L

#3
by notafraud - opened

Hi! Thank you for your attention to quantization! Have you tested/considered a Q3_K_L-based custom quantization? In multiple other models, _L quants have shown improved quality compared to _S and _M ones. It would be awesome to get just a bit more quality, especially since a 21b model in Q3_K_L still fits into 12gb VRAM (with a room context).

Reka AI org
β€’
edited Jul 15

Hi! Thank you for your interest. We targeted Q3_K_S for this release to showcase the much reduced error at low bitrates, but adding new quantization types should be easy. In the case of Q3_K_L, it is the same as Q3_K_S except it keeps some of the tensors in different precision (e.g attn value matrix and some other matrices). One would simply need to add the corresponding quantization schedule to RekaQuant (the example schedule for Q3_K_S: https://github.com/reka-ai/rekaquant/blob/4800b7fbb34b79b755ee2a8a7bb015ff16a56b71/src/train.py#L336 ), and run it on any model. We don't plan to release any more quants ourselves in the short term, but if there is interest from the community we could consider it :)

How about Q4_K_M or Q4_K_L quant models? It will fit nicely into a 16gb gpu ;). Thanks.

Sign up or log in to comment