library_name: transformers | |
tags: [] | |
# Deepseek-v3-Base Group 8 Average Weights | |
Since Deepseek v3 dense layers ( first 3 ) happens to be 18432 which equals to 9 x 2048. Since there's 256 experts with 2048 dimensions as the intermediate dimension, we can first average all 32 experts as 1 expert and concatenate all 8 groups into a 16384 layers. Adding the share_experts in to the new MLP layers and we can get 18432 MLP layers. | |
## Model Details | |
### Model Description | |
Unfortunately, this model doesn't work out of the box ( after dequantize and merging ) all it generates is giberish tokens. So either my code sucks or merging all that experts down breaks the model too much that every brokes. | |
I'm trying to recover the MLP layers by pretraining, but I'm bit low on compute and doesn't have much to spare. Also if you have small corpus which I can use feel free to comments and suggest what I should do next | |
QLoRA pretrained test run can be found here : [theblackcat102/whale-v3-base-concept-test-lora-380](https://huggingface.co/theblackcat102/whale-v3-base-concept-test-lora-380) | |