Model Card for tim-lawson/skip-middle-fineweb-baseline-4-layers
We trained all models on a randomly-sampled subset of the FineWeb dataset with approximately 10B tokens (Penedo et al., 2024), pre-tokenized with the GPT-2 tokenizer via the TikToken library (Radford et al., 2019; Ouyang et al., 2022). The validation set contained approximately 100M tokens. We used a global batch size of 512 sequences (524 288 tokens) with data parallelism and gradient accumulation over a per-device batch size of 32 sequences on 4 NVIDIA A100 or GH200 GPUs.
We based the underlying Transformer models on the reference implementation of Llama 3 (Grattafiori et al., 2024). In particular, we used: Grouped Query Attention (GQA; Ainslie et al. 2023); Rotary Positional Embeddings (RoPE; Su et al. 2024); Gated Linear Unit FFNs with Swish activation (SwiGLU; Shazeer 2020); and Root Mean Square (RMSNorm) layer normalization (Zhang and Sennrich, 2019). The key difference relative to Llama 3 is that we used the Sandwich-LN scheme (Ding et al., 2021; Kim et al., 2025) instead of Pre-LN. We initialized RMSNorm parameters to one and sampled all others from the normal distribution with mean zero and standard deviation 0.02.
The training codebase is based on the ‘nanoGPT speedrun’ repository (Karpathy, 2025; Jordan, 2025). We used the AdamW optimizer with a single learning rate for all model parameters (Kingma and Ba, 2017; Loshchilov and Hutter, 2019), and a two-stage learning-rate scheduler with linear warm-up over 10% of the training steps, starting at 10% of the maximum learning rate, and cosine decay over the remaining steps. Lastly, we performed forward passes in bfloat16 with automatic mixed precision in PyTorch (except manually converting attention logits to float32).
Model Sources
- Repository: https://github.com/tim-lawson/skip-middle
- Paper: https://arxiv.org/abs/2506.21103
- Downloads last month
- 2
