Besides pruning..

#4
by Lockout - opened

Can we put these experts in CPU ram and leave the actually used ones on GPU? This would be helpful for the 235b since its huge and as a bonus no retraining.

llama.cpp can do this using the -ot parameter. eg. to offload all MoE tensors, you pass -ot ".ffn_.*_exps.=CPU"

Could we get a list of those tensors?

Is there any way to find out which MoE tensor relates to which expert?

Alright, the information is part of the repository check routing_stats_layer_X.txt

Does llama.cpp support offloading to storage?

Alright, the information is part of the repository check routing_stats_layer_X.txt

Did a bit of research, llama.cpp can only offload whole layers. Not individual MoE experts.

Does llama.cpp support offloading to storage?

It does this automatically unless you disable memory mapping.

Sign up or log in to comment