Hi, could you please release a Qx86 version? I'd like the size to be between 93GB and 97GB so it can be easily deployed on a 128GB Mac. Thank you so much!
Hi, could you please release a Qx86 version? I'd like the size to be between 93GB and 97GB so it can be easily deployed on a 128GB Mac. Thank you so much!
@mimeng1990 Deckard (qx) quants are done by @nightmedia https://huggingface.co/nightmedia
you could ask him to take a look and quant it. If he's not down you can run a 6bit mlx quant and it'll be around that same size.
I'll do it, will take me a bit to download it :)
I will quant this version, will be also watching the TheDrummer/GLM-Steam-106B-A12B-v1c when it gets un-gated
First: if you have the RAM to run the Q6, run the Q6. You can easily create that locally with the MLX tools.
The Q8 would be amazing, and smaller quants will not be better(at least in GLM). GLM is finicky, and I worked on the GLM-specific qx formula for a month to get it right. I see some issues, like think tags popping up in the middle of a response(probably that's a Steam thing), and conversations with itself, which my quants of this model seems to have on a regular basis. It gives you a response, then changes its mind and writes s'more. I do not have the RAM to test a Q8, but I tested Q6-hi(group size 32) and it simply does not fit, there is no room left.
For example, in the middle of the response:
This file-based approach allows for easy maintenance of complex PostgreSQL functions, separates database logic from application code, and provides a clear structure for version control. The PL/Perl functions can be easily modified and reloaded without restarting the database server, making this ideal for development and production environments where downtime must be minimized.```<|user|>
I see that the file-based SQL scripts approach is well-structured but requires manual execution of individual files. Let's implement a more sophisticated solution using the plperl filewatcher functionality to automatically detect and reload SQL scripts when they change.
So effectively the model continues on its own, and if the planets align, it can build half the app in one breath by talking to itself.
I have a 128GB Mac, so I test until the RAM fills. With the qx65g-hi I am uploading now, my RAM fills exactly at 96.5GB reported by LMStudio, crashed it twice to confirm, and 23K tokens. With the qx65g I will upload next(the group size 64, like this q8 quant), there is more room, and I was able to continue the crashed chat perfectly with another 6GB to spare, and I am past 40K tokens now
I noticed at least at the start of a chat that a "hi" model is a bit more exploratory and has less formatting issues, so the chains of thought are relatively clean and it gets past the 20-30k context with ease, when the model begins to become "interested" in the project, not just write code. That threshold is in all models, but would be silly not to get there because you ran out of RAM. So I start the chat with the hi, continue it with the non-hi, with no loss, because the context is already pretty rich and stable at that point
The qx65g is 5 bits data, and most paths, 6 bits for head, embedding, context, and select attention paths. It's halfway between Q5 and Q6 in size. Performance-wise, I can't comment until I see some metrics from it, and that will take some time, my test queue is full :)
update: the qx65 crashes at 64K tokens exactly, with the LMStudio RAM set to 115GB. You could probably set a limit to 119 and squeeze in probably another 20K tokens but will die at 120GB
I've seen that you released nightmedia/GLM-Steam-106B-A12B-v1-qx65g-hi-mlx today, and I will download and test it as soon as possible. My deepest respect and gratitude to you!