What do we know about the architecture so far?

by amgadhasan - opened 27 days ago

Discussion

amgadhasan

27 days ago

Hi,

Has anyone got any info about the architecture?

I suppose it's a MoE? What's the number of total and active params?

Does it support audio or vision input?

Also, this is the chat/instruct version right?

llama-anon

27 days ago

250b~ total params, since its bf16 and 500gb total space
seems to have shared/common layers like deepseek and llama4

ilintar

27 days ago

MoE, no vision support, from my rough calculations, it's something like ~260B-A30B?

shekkizh

26 days ago

Reading from the config attached to the model repo - same architecture as Grok-1. So, 314B MoE with 8 experts and 2 active for inference. no additional capabilities (same as grok-1)

theo77186

26 days ago

about 270B params MoE, 115B active (2 experts out of 8). The shared FFN layers are very large. The tensors seem to be presharded for 8-fold tensor parallelism.

kimjohn

25 days ago

•

edited 25 days ago

The Grok-2 model has a total of ~270B parameters (269,515,497,472).
Among them, the activated parameters per forward pass (since only 2 out of 8 MoE experts are selected) are ~115B (115,019,056,768).

Total parameters: ~270B
Activated parameters: ~115B

The following is a breakdown of the parameter counts for each component of the model.

1. Embedding layer

The embedding parameters consist of embed_tokensand lm_head, each with a size of 131,072 × 8,192.

2 × 131,072 × 8,192 = 2,147,483,648

2. Normalization layers

Each of the 64 transformer blocks contains four norms (pre_attn_norm, post_attn_norm, pre_moe_norm, post_moe_norm), plus one final model.norm.

(8,192 × 4 × 64) + 8,192 = 2,101,248

3. Attention layers

Each block has four projection matrices:

q_proj, o_proj: 8,192 × 8,192
k_proj, v_proj: 1,024 × 8,192

Per layer:

(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2) = 137,363,456

Across 64 layers:

137,363,456 × 64 = 8,791,261,184

4. Shared Feed-Forward (FFN)

Each layer has three shared projections (gate_proj, down_proj, up_proj), each of size 32,768 × 8,192.

32,768 × 8,192 × 3 = 805,306,368

Across 64 layers:

805,306,368 × 64 = 51,539,607,552

5. MoE experts

In Grok-2, each Mixture-of-Experts (MoE) layer contains 8 experts.
Each expert has three weight matrices (w1, w2, w3), which correspond to the feed-forward projections inside the expert.

Because Grok-2 uses Tensor Parallelism (TP = 8), each expert is split into 8 shards, and each shard holds a fraction of the parameters.

Parameters per expert shard (including w1, w2, w3):

2,048 × 8,192 × 3 = 50,331,648

Per layer (8 experts, each split into 8 shards under TP=8):

50,331,648 × 8 × 8 = 3,221,225,472

Across all 64 layers:

3,221,225,472 × 64 = 206,158,430,208

However, during a forward pass only 2 of the 8 experts in each layer are activated.
Thus the effective MoE parameters per forward pass are:

206,158,430,208 × (2/8) = 51,539,607,552

Parameter Breakdown Table

Component	Formula	Parameter Count
Embedding layer	`2 × 131,072 × 8,192`	2,147,483,648
Norm layers	`(8,192 × 4 × 64) + 8,192`	2,101,248
Attention (64 layers)	`[(8,192 × 8,192 × 2) + (1,024 × 8,192 × 2)] × 64`	8,791,261,184
Shared FFN (64 layers)	`(32,768 × 8,192 × 3) × 64`	51,539,607,552
MoE experts (64 layers, full)	`(2,048 × 8,192 × 3 × 8 × 8) × 64`	206,158,430,208
↳ Activated MoE (2 of 8 experts)	`[(2,048 × 8,192 × 3 × 8 × 8) × 64] × (2/8)`	51,539,607,552
Total parameters	—	269,515,497,472
Total activated parameters	—	115,019,056,768

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment