Besides pruning..

by Lockout - opened May 3

May 3

Can we put these experts in CPU ram and leave the actually used ones on GPU? This would be helpful for the 235b since its huge and as a bonus no retraining.

Marcuss2

May 3

llama.cpp can do this using the -ot parameter. eg. to offload all MoE tensors, you pass -ot ".ffn_.*_exps.=CPU"

Could we get a list of those tensors?

Dampfinchen

May 3

Is there any way to find out which MoE tensor relates to which expert?

Marcuss2

May 3

Kalomaze seems to have this information. https://xcancel.com/kalomaze/status/1918238263330148487?cursor=QAAAAPAxHBlmwsGwpf_FlJ81tIfTxcb5up81kIOylZj7tp81jsbTzaC_-541qsW-yYPv-Z41sofWsYCDmp81JQISFQQAAA#r

Marcuss2

May 3

Alright, the information is part of the repository check routing_stats_layer_X.txt

someone13574

May 3

Does llama.cpp support offloading to storage?

Marcuss2

May 3

Alright, the information is part of the repository check routing_stats_layer_X.txt

Did a bit of research, llama.cpp can only offload whole layers. Not individual MoE experts.

Does llama.cpp support offloading to storage?

It does this automatically unless you disable memory mapping.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment