codelion (Asankhaya Sharma)

commented on The Optimal Architecture for Small Language Models about 11 hours ago

I ran the numbers on layer-only params (excluding embeddings):

Config	Hidden	Layers	Layer Params	Score	Tier
4L	768	4	28.3M	31.98%	Low
12L	512	12	37.7M	38.15%	High
16L	448	16	38.5M	32.61%	Low
24L	384	24	42.5M	31.79%	Low
32L	384	32	56.6M	38.50%	High
48L	320	48	59.0M	32.45%	Low
64L	256	64	50.3M	38.21%	High

The 48L config has the most layer params (59M) but is in the Low tier, while 12L has fewer (37.7M) and is High tier.

The hidden dimension threshold still dominates. But er-layer representation width seems critical, with hidden=320 or 256, you create an information bottleneck that more layers can't overcome, unless you hit the critical depth thresholds (32 or 64 layers) where something else compensates.

This suggests the finding should be reframed as: at small scale, you need sufficient hidden dimension AND appropriate depth.

(BTW, based on your earlier comment I've added a note to the article clarifying the parameter matching limitations — thanks for the feedback!)

commented on The Optimal Architecture for Small Language Models about 22 hours ago

Here's the full breakdown of where parameters come from:

Embeddings (scales linearly with d_model)

Token embeddings: vocab_size × d_model = 50,257 × d
Position embeddings: 1,024 × d
Total: ~51,281 × d

Per transformer layer (scales quadratically with d_model)

Attention (Q, K, V, O): 4 × d²
MLP (up + down, with 4x intermediate): 2 × d × 4d = 8d²
LayerNorms: ~4d (negligible)
Total per layer: ~12d²

LM Head

Usually tied with embeddings (free) or d × vocab_size

4L × 768:

Embeddings: 51,281 × 768 ≈ 39.4M
Layers: 4 × 12 × 768² ≈ 28.3M
Total: ~68M

12L × 512:

Embeddings: 51,281 × 512 ≈ 26.3M
Layers: 12 × 12 × 512² ≈ 37.7M
Total: ~64M

commented on The Optimal Architecture for Small Language Models about 22 hours ago

Thanks for the references I will take a look.

New activity in codelion/dhara-70m 1 day ago

High-throughput deployment use cases

1

#1 opened 1 day ago by

Cagnicolas

reacted to their post with 👍 2 days ago

Post

5522

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

reacted to their post with 🤗🚀🔥 3 days ago

Post

5522

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

posted an update 3 days ago

Post

5522

Introducing Dhara-70M: A diffusion language model that achieves 3.8x higher throughput than autoregressive models!

Key findings from our research on optimal architectures for small language models:

→ Depth beats width: 32 layers outperforms 12 layers at the same parameter count
→ Best-in-class factuality: 47.5% on TruthfulQA
→ 10x training efficiency using WSD (Warmup-Stable-Decay) conversion
→ Canon layers add only 0.13% parameters but improve reasoning

We trained on 1B tokens using the optimal 50-30-20 dataset mix (PDFs + filtered web + educational content), then converted to diffusion with just 100M additional tokens.

Blog: https://huggingface.co/blog/codelion/optimal-model-architecture
Model: codelion/dhara-70m

1 reply

·

published an article 3 days ago

Article

The Optimal Architecture for Small Language Models

3 days ago

•

51

liked a model 3 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 4 days ago • 3.17k • 19

published a model 3 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 4 days ago • 3.17k • 19

updated a model 3 days ago

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

Updated 3 days ago • 28

updated a model 4 days ago

codelion/dhara-70m

Text Generation • 71.3M • Updated 4 days ago • 3.17k • 19

upvoted an article 4 days ago

Article

The Optimal Architecture for Small Language Models

3 days ago

•

51

updated a collection 4 days ago

Dhara Foundational Models

Collection

Diffusion Language Models combining deep narrow networks, Canon layers (depthwise causal convolutions), and WSD (Warmup-Stable-Decay) training. • 1 item • Updated 2 days ago • 2

published a model 6 days ago

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

Updated 3 days ago • 28

updated 2 Spaces 9 days ago

PTS Visualizer

🔍

4

Visualize pivotal tokens and thought anchors in language models

PTS Visualizer

🔍

4

Visualize pivotal tokens and thought anchors in language models

reacted to their post with 👍 9 days ago

Post

2337

Introducing PTS Visualizer - an interactive tool for exploring how language models reason!

Visualize pivotal tokens, thought anchors, and reasoning circuits. See which tokens and sentences significantly impact success probability, explore embedding clusters, and trace reasoning step-by-step.

Try it: codelion/pts-visualizer

Explore PTS datasets:
- Qwen3-0.6B: codelion/Qwen3-0.6B-pts
- DeepSeek-R1: codelion/DeepSeek-R1-Distill-Qwen-1.5B-pts

Or upload your own JSONL files!

GitHub: https://github.com/codelion/pts

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

High-throughput deployment use cases

The Optimal Architecture for Small Language Models

codelion/dhara-70m

codelion/dhara-70m

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

codelion/dhara-70m

The Optimal Architecture for Small Language Models

Dhara Foundational Models

codelion/Qwen3-4B-Instruct-2507-self-verify-lora

PTS Visualizer

PTS Visualizer

Asankhaya Sharma

AI & ML interests

Recent Activity

Organizations

codelion's activity

High-throughput deployment use cases

The Optimal Architecture for Small Language Models

The Optimal Architecture for Small Language Models

PTS Visualizer

PTS Visualizer