Awesome!
Can you share how you ran the repo's script? What parameters? I assume it’s a 25% cull?
What hardware did it require?
Is there any other data you can share, like data used to select the experts to cull?
I assume it needs HW to run full/fp8 pytorch model. I'd love some GGUF of any of these reaps at least. Higher quants/faster speeds. EXL3 in 96gb looking good too at the smaller size.
The method itself I think uses inference to figure out what parameters to cut so it isn't blind.
Yeah.
I’m asking because even a 10-15% prune would be perfect for me (depending on how lossy it actually is), and if the “list” of experts to cull is already made, perhaps other sizes could be too.
I’m also interested in the reference dataset. Was it coding, which seems to be the suggested default in the scripts? Perhaps optimal culls for different tasks are different.
Yea I'd hate for it to be coding and I'm trying to do conversations. Without quants I have no way of trying them out. It's too much to download. They could be perfect or completely awful.
I made some real scuffed modifications to the code via fork: https://github.com/AesSedai/reap
Then rented some cloud compute to perform the REAP, here were my approx. steps that I recorded. The whole nvme formatting bit can be ignored, was more something with how the instance was set up with attached unformatted storage:
git clone https://github.com/AesSedai/reap.git
# python.h header files needed for zstandard, needed for helm (https://github.com/stanford-crfm/helm.git)
sudo apt-get update
sudo apt-get install python3.12-dev
source .venv/bin/activate
# format nvme storage, 3.5TB per disk so one disk is fine
lsblk
sudo parted /dev/nvme1n1
sudo parted /dev/nvme1n1 mklabel gpt
sudo parted -a opt /dev/nvme1n1 mkpart primary ext4 0% 100%
sudo mkfs.ext4 /dev/nvme1n1p1
sudo mkdir /mnt/slush
sudo mount /dev/nvme1n1p1 /mnt/slush
lsblk -f
df -h /mnt/slush
sudo mkdir -p /mnt/slush/pruned_models
sudo chown -R user:user /mnt/slush/
ln -s /mnt/slush/pruned_models /home/user/reap/artifacts/GLM-4.6/evol-codealpaca-v1/pruned_models
python ./scripts/patch_glm.py
bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false
The lm_eval worked, but evalplus ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.
Is there a way to get a awq version of this? that would be awesome but I cannot find it.
Awesome! Thanks for the steps.
bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false
The `lm_eval` worked, but `evalplus` ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.
What do you mean by this? Do you change 'theblackcat102/evol-codealpaca-v1' to some lm_eval benchmark, and it will use that for the pruning?
The theblackcat102/evol-codealpaca-v1 is the dataset used for producing the list of experts to prune if I understand the REAP code correctly. That is the default dataset arg that they provided in their repo's README so I just went with that.
lm_eval uses its own defaults, those five true true true false false flags at the end are what tells it which benchmarks to run:
- lm_eval
- evalplus
- livecodebench
- math
- wildbench
It did the lm_eval benchmark successfully, but was failing when trying to run the evalplus benchmark due to the host I was running on not being able to setup the evalplus dataset due to some HTTP error that prevented it from downloading from github. Not sure if it was some sort of odd, possibly region or IP-specific issue since I was able to git clone from github to set up REAP on the host. But I kept getting an HTTP connection reset and it wasn't able to download.
Hah, sounds like one of those “works on our lab machine at this point in time” kind of repo things.
Thanks.
I want to try this with tiny MoEs and see if other datasets work (and prune a notably different set of experts), then rent something to try on 4.6.
Judging from the way the git submodules are configured in the original repo, pointing to a private user version of the upstream dependencies, that's probably true. One of the first changes I had to do was to point to the public upstream git repos for lm_eval, evalplus, helm, etc.
I am stumbling my way around preserving MTP tensors during REAP here: ddh0/reap:mtp. Aes Sedai now has push access to this fork as well so hopefully we can get something working soon. 🤞
then rent something to try on 4.6
I haven't looked at the code yet, but I wasn't able to create an exllamav3 quant of this one, due to the number of experts (120) not being divisible by 32.
So that might be something to consider when you prune GLM-4.6. (eg. keeping 128 experts instead of 120).
I am stumbling my way around preserving MTP tensors during REAP
Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?
I am stumbling my way around preserving MTP tensors during REAP
Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?
Yes, llama.cpp expects them to be present and won't load without them. We could maybe patch llama.cpp but I think it's better to preserve the MTP tensors to avoid having to re-quant later when MTP becomes supported
Yes, llama.cpp expects them to be present and won't load without them.
Oh, I managed to create a gguf without them and it seems coherent.
That's just a quick Q2_K with no imatrix ^
Just swap this from 1 -> 0 before running the conversion script:
https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
Yes, llama.cpp expects them to be present and won't load without them.
Oh, I managed to create a gguf without them and it seems coherent.
That's just a quick Q2_K with no imatrix ^
Just swap this from 1 -> 0 before running the conversion script:
https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32
Interesting, good to know that's a workaround at least!
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.
Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..So... the question is...
How is it?
Am using the big one at Q3K_XL so IQ4_NL or one of them..
Check out this quant if you haven't already: Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF.
It seems like the best bang-for-buck if you can run it.
Kudos on seeing the Q2K..
just holding out for larger Q3/Q4 quants.
I'll put the Q4_K up for a little while then (until I get close to the HF public storage limit again), but these guy are probably going to make proper i-matrix'd quants once they get MTP re-implemented.
How is it?
I only tested it briefly, it seemed "normal" to me. I didn't imatrix it. A pruned model like this without retraining, I doubt will be better than the big one at Q3K_XL.
Well the interesting question is where the “crossover “ is.
If one can only run a Q2KL mix, would pruning 12% and jumping up above 3bpw be better? The paper certainly suggests so. There’s a steep cliff between IQ2KL and IQ3KS.
Same with exl3, as dropping from 3bpw to 2 is painful.
Even if the losses are really domain specific (with the default being alpaca code style questions), thats still interesting, as prunes could be “specialized” with no retraining.
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.
I was arguing with someone about this the other day, and suspected this, as the CPU doesn’t have the “extra” compute to make MTP so cheap like GPU.
Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..
I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.
So... the question is...
How is it?
That is an excellent question.
Is KLD testing this vs full GLM valid? Or are benchmarks the only reliable way?
I don't know if KLD is valid but it's certainly going to be enlightening. KLD against the full model and KLD against the pruned model. Guess I better start downloading, its probably going to take overnight. Q3K_XL is 158gb and this is 162gb so I keep speed the same but in theory gain fidelity.
Dataset was english, right? So that means we're shedding the CN experts. Something that was already done with qwen in a more brutal way. IIRC experts only expert in series of tokens like punctuation, etc.
Dataset was english, right? So that means we're shedding the CN experts.
No
Yeah I’m not sure I buy that. Don’t LLMs abstract language away, hence they pick up stuff even if trained in another language?
The testing will be enlightening. I am AFK, but will try some stuff with smaller MoEs.
It's not something to buy, it's the results of a similar experiment that kalomaze did: https://huggingface.co/kalomaze/Qwen3-16B-A3B/discussions/14
Using the model at Q4_K, it surprised me by getting some logic tests right that the full version (IQ3_KS and Z.AI API) get wrong...?
I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.
I've created an imatrix from the Q8_0 using ubergarm's calibration data.
Note: That Q4_K quant in that repo is not calibrated though.
similar experiment that kalomaze did
That model lost a lot more than just non-English though lol.
I used the model a bit more now at Q4_K.
It seemed good for 1-shot translations, throw-away app generation, etc. But it seems to degrade quickly with multi-turn conversations.
Still coherent, but it misses information randomly (Full version at iq2_ks doesn't do this).
