I ran the numbers on layer-only params (excluding embeddings):
| Config | Hidden | Layers | Layer Params | Score | Tier |
|---|---|---|---|---|---|
| 4L | 768 | 4 | 28.3M | 31.98% | Low |
| 12L | 512 | 12 | 37.7M | 38.15% | High |
| 16L | 448 | 16 | 38.5M | 32.61% | Low |
| 24L | 384 | 24 | 42.5M | 31.79% | Low |
| 32L | 384 | 32 | 56.6M | 38.50% | High |
| 48L | 320 | 48 | 59.0M | 32.45% | Low |
| 64L | 256 | 64 | 50.3M | 38.21% | High |
The 48L config has the most layer params (59M) but is in the Low tier, while 12L has fewer (37.7M) and is High tier.
The hidden dimension threshold still dominates. But er-layer representation width seems critical, with hidden=320 or 256, you create an information bottleneck that more layers can't overcome, unless you hit the critical depth thresholds (32 or 64 layers) where something else compensates.
This suggests the finding should be reframed as: at small scale, you need sufficient hidden dimension AND appropriate depth.
(BTW, based on your earlier comment I've added a note to the article clarifying the parameter matching limitations — thanks for the feedback!)