Performance of the Fused Model
Hi, glad to see such practical and wonderful repo. I got to know this project in reddit. Obviously, pruning or fusing some experts to make it smaller is a direct and promising way to compress the deepseekv3.
I noticed that you have provided several techniques in moe-pruner
repo. Can you provide the performance like ppl, downstream tasks (like MMLU, GSM8K) performance?
Hello there!
Thanks for the appreciation,
I haven't been able to run any benchmark yet, as the fused architecture are not compatible with inferences engines as of now and pure pytorch inference is far too slow;
So the only think i can say for sure is that even the Unhealed models are capable of generating coherent english (not an achievement, but not bad for a model that los 96% of its parameters).
I am currently running postraining on the 29B versions, i'll take a look into how to quantize and inference it with vllm and / or gguf after that (due to the size it will require a good tensor parallelism implementation to run effectively).