Performance of the Fused Model

#1
by pprp - opened

Hi, glad to see such practical and wonderful repo. I got to know this project in reddit. Obviously, pruning or fusing some experts to make it smaller is a direct and promising way to compress the deepseekv3.

I noticed that you have provided several techniques in moe-pruner repo. Can you provide the performance like ppl, downstream tasks (like MMLU, GSM8K) performance?

Owner

Hello there!

Thanks for the appreciation,

I haven't been able to run any benchmark yet, as the fused architecture are not compatible with inferences engines as of now and pure pytorch inference is far too slow;
So the only think i can say for sure is that even the Unhealed models are capable of generating coherent english (not an achievement, but not bad for a model that los 96% of its parameters).

I am currently running postraining on the 29B versions, i'll take a look into how to quantize and inference it with vllm and / or gguf after that (due to the size it will require a good tensor parallelism implementation to run effectively).

Sign up or log in to comment