What do we know about the architecture so far?

#6
by amgadhasan - opened

Hi,

Has anyone got any info about the architecture?

I suppose it's a MoE? What's the number of total and active params?

Does it support audio or vision input?

Also, this is the chat/instruct version right?

250b~ total params, since its bf16 and 500gb total space
seems to have shared/common layers like deepseek and llama4

MoE, no vision support, from my rough calculations, it's something like ~260B-A30B?

Reading from the config attached to the model repo - same architecture as Grok-1. So, 314B MoE with 8 experts and 2 active for inference. no additional capabilities (same as grok-1)

about 270B params MoE, 115B active (2 experts out of 8). The shared FFN layers are very large. The tensors seem to be presharded for 8-fold tensor parallelism.

Sign up or log in to comment