Besides pruning..
Can we put these experts in CPU ram and leave the actually used ones on GPU? This would be helpful for the 235b since its huge and as a bonus no retraining.
llama.cpp can do this using the -ot
parameter. eg. to offload all MoE tensors, you pass -ot ".ffn_.*_exps.=CPU"
Could we get a list of those tensors?
Is there any way to find out which MoE tensor relates to which expert?
Kalomaze seems to have this information. https://xcancel.com/kalomaze/status/1918238263330148487?cursor=QAAAAPAxHBlmwsGwpf_FlJ81tIfTxcb5up81kIOylZj7tp81jsbTzaC_-541qsW-yYPv-Z41sofWsYCDmp81JQISFQQAAA#r
Alright, the information is part of the repository check routing_stats_layer_X.txt
Does llama.cpp support offloading to storage?
Alright, the information is part of the repository check routing_stats_layer_X.txt
Did a bit of research, llama.cpp can only offload whole layers. Not individual MoE experts.
Does llama.cpp support offloading to storage?
It does this automatically unless you disable memory mapping.