The qx64-mlx quantization

#9
by nightmedia - opened

This is a formula I use for Qwens, mostly MoEs but it showed to work on dense models as well. I did not know whether it will work for Apertus, and probably could get mixed results(running integration tests now). The formula uses mixed precision layers, and is basically a 4bit quant with 6bit paths for attention and context(Deckard)

I shared it because I found it to be apt in coding in Janet, and once it gets going, it develops an appetite for coding.

Interesting model

Swiss AI Initiative org
edited 14 days ago

thanks!
there are already a few other quantized versions for MLX here, i tried the 4bit and 8bit and works quite well: https://huggingface.co/models?search=apertus%20mlx

looking forward to try yours also. did you do yours for the 70B or 8B, and is there a link also?

To be honest I did not know whether to upload the 8B after trying it first time. It is quite different from interaction with a similar model from Qwen. After trying the settings on the 8B model I found that TopK around 20 makes it a bit more fluent. The default settings in LMStudio are not helping.

I am uploading now the nightmedia/Apertus-8B-Instruct-2509-qx86-mlx

Similar formula, more bits. Small models lose a lot more at quantization. I tried even mxfp4, works eh-ok on the 70B, not so much on the 8B.

In some cases this approach sharpens the quality of the output, I am really curious how it will do on an 8B, as I only did a couple small tests. I ran into a MoE that was outperforming at qx86-hi the parent model at BF16, and even the qx64-hi was getting pretty close

Sign up or log in to comment