The qx64-mlx quantization

by nightmedia - opened 14 days ago

14 days ago

•

This is a formula I use for Qwens, mostly MoEs but it showed to work on dense models as well. I did not know whether it will work for Apertus, and probably could get mixed results(running integration tests now). The formula uses mixed precision layers, and is basically a 4bit quant with 6bit paths for attention and context(Deckard)

I shared it because I found it to be apt in coding in Janet, and once it gets going, it develops an appetite for coding.

Interesting model

mjaggi

Swiss AI Initiative org 14 days ago

•

edited 14 days ago

thanks!
there are already a few other quantized versions for MLX here, i tried the 4bit and 8bit and works quite well: https://huggingface.co/models?search=apertus%20mlx

looking forward to try yours also. did you do yours for the 70B or 8B, and is there a link also?

nightmedia

14 days ago

•

edited 14 days ago

To be honest I did not know whether to upload the 8B after trying it first time. It is quite different from interaction with a similar model from Qwen. After trying the settings on the 8B model I found that TopK around 20 makes it a bit more fluent. The default settings in LMStudio are not helping.

I am uploading now the nightmedia/Apertus-8B-Instruct-2509-qx86-mlx

Similar formula, more bits. Small models lose a lot more at quantization. I tried even mxfp4, works eh-ok on the 70B, not so much on the 8B.

In some cases this approach sharpens the quality of the output, I am really curious how it will do on an 8B, as I only did a couple small tests. I ran into a MoE that was outperforming at qx86-hi the parent model at BF16, and even the qx64-hi was getting pretty close

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment