Model Card for GPT-BERT Causal Focus Small
A 31M model trained on 100M (10M unique words) able to do both causal and masked inference.
Table of Contents
- Model Card for GPT-BERT Small Causal Focus
- Table of Contents
- Model Details
- Uses
- Training Details
- Evaluation
- Technical Specifications
- Citation
- Model Card Authors
- Bibliography
Model Details
Model Description
This one of the three GPT-BERT baselines for the strict-small track of the 2025 BabyLM challenge. This specific model is trained with a majority number of examples being causal and a minority being masked.
- Developed by: Lucas Georges Gabriel Charpentier
- Model type: Language model (Causal and Masked)
- Language(s) (NLP): eng
- License: apache-2.0
- Resources for more information:
Uses
This is a pre-trained language model. It can be used to evaluate tasks zero-shot in both a causal and masked setting. It can also be fine-tuned by adding a new head and dropping the language modeling head. It can be used for language generation but given its small size and low number of words trained on, do not expect LLM-level performance. It can also be used for mask infilling.
Training Details
Training Data
We used the BabyLM 10M (Strict-small) dataset to train the model. It is composed in the following way:
Source | Weight | Domain | Citation | Website | License |
---|---|---|---|---|---|
BNC | 8% | Dialogue | BNC Consortium (2007) | link | link 1 |
CHILDES | 29% | Dialogue, Child-Directed | MacWhinney (2000) | link | |
Project Gutenberg | 26% | Fiction, Nonfiction | Gerlach & Font-Clos (2020) | link | link |
OpenSubtitles | 20% | Dialogue, Scripted | Lison & Tiedermann (2016) | link | Open source |
Simple English Wikipedia | 15% | Nonfiction | -- | link | link |
Switchboard | 1% | Dialogue | Godfrey et al. (1992), Stolcke et al., (2000) | link | link |
1 Our distribution of part of the BNC Texts is permitted under the fair dealings provision of copyright law (see term (2g) in the BNC license).
Hyperparameters
Hyperparameter | Value |
---|---|
% Causal Objective | 93.75% |
% Masked Objective | 6.25% |
Sequence Length | 128 β 512 |
Batch Size (in tokens) | 16 384 |
Learning Rate | 0.007 |
Number of Steps | 9 914 |
Warmup Ratio | 1.6% |
Cooldown Ratio | 1.6% |
Mask Ratio | 0.3 β 0.15 |
Random Ratio | 0.1 |
Keep Ratio | 0.1 |
Weight Decay | 0.1 |
Optimizer | LAMB |
Optimizer Epsilon | 10-8 |
Optimizer Beta_1 | 0.9 |
Optimizer Beta_2 | 0.98 |
Grdient Clipping | 2.0 |
Z-Loss weight | 0.0001 |
Training Procedure
During training we vary both the mask token percentage (linear decay from 30% to 15%), and the sequence length. For the sequence length we make sure to keep the total tokens per batch the same by reducing the batch size proportionally to the sequence length. We have three steps in the sequence length:
- We start with a sequence length of 128 for 60% of the training.
- The next 20% has a sequence length of 256.
- The final 20% has a sequence length of 512. We use a Warmup-Cosine-Cooldown scheduler for the training with the percentages reported in the Hyperparameters
Size and checkpoints
The model has 31M parameters. In total we train on around 100M words (or ten repetitions of the training set). We provide multiple checkpoints from the training. Specifically we provode:
- Checkpoints every 1M words of pretraining for the first 10M words (or every 99.14 steps)
- Checkpoints every 10M words of pretraining for the first 100M words (or every 991.4 steps)
Evaluation
This model is evaluated in three different fashions:
- We provide a validation loss calculated on 1M words from the development set of the BabyLM data (same source as those found in Training Data).
- We do zero-shot evaluation on 7 tasks.
- We do fine-tuning on a subset of the (Super)GLUE tasks (Wang et al., ICLR 2019; Wang et al., NeurIPS 2019) .
Testing Data & Metrics
Testing Data
For the BLiMP, BLiMP supplement, and EWoK tasks, we use a filtered version of the dataset to only include examples with words found in the BabyLM dataset. For the Finetuning task, we both filter and sample down to a maximum 10 000 train examples.
Validation Data
1M words from the developement split of BabyLM. The evaluation is done using the Masked Next Token Prediction objective.
Zero-shot Tasks
- BLiMP: The Benchmark of Linguistic Minimal Pairs evaluates the model's linguistic ability by seeing if it can recognize the grammatically correct sentence from a pair of minimally different sentences. It tests various grammatical phenomena.(Warstadt et al., TACL 2020)
- BLiMP Supplement: A supplement to BLiMP introduced in the first edition of the BabyLM challenge. More focused on dialogue and questions. (Warstadt et al., CoNLL-BabyLM 2023)
- EWoK: Works similarly to BLiMP but looks the model's internal world knowledge. Looking at both whter a model has physical and social knowledge. (Ivanova et al., 2024)
- Eye Tracking and Self-paced Reading: Looks at whether the model can mimick the eye tracking and reading time of a human but using surprisal of a word as a proxy for time spent reading a word. (de Varda et al., BRM 2024)
- Entity Tracking: Checks whether a model can keep track of the changes to the states of entities as text/dialogue unfolds. (Kim & Schuster, ACL 2023)
- WUGs: Tests morphological generalization in LMs through an adjective nominalization task. (Hofmann et al., 2024)
Finetuning Tasks
- BoolQ: A yes/no QA dataset with unprompted and unconstrained questions. (Clark et al., NAACL 2019)
- MNLI: The Multi-Genre Natural Language Inference corpus tests the language understanding of a model by seeing wehther it can recognize textual entailment. (Williams et al., NAACL 2018)
- MRPC: The Microsoft Research Paraphrase Corpus contains pairs of sentences that are either paraphrases/semntically equivalent to each other or unrelated.(Dolan & Brockett, IJCNLP 2005)
- QQP2: Similarly to MRPC, the Quora Question Pairs corpus tests the models ability to determine whether a pair of questions are sematically similar to each other. These questions are sourced from Quora.
- MultiRC: The Multi-Sentence Reading Comprehension corpus is a QA task that evaluates the model's ability to the correct answer from a list of answers given a question and context paragraph. In this version the data is changed to a binary classification judging whether the answer to a question, context pair is correct. (Khashabi et al., NAACL 2018)
- RTE: Similar the Recognizing Text Entailement tests the model's ability to recognize text entailement. (Dagan et al., Springer 2006; Bar et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., TAC 2009)
- WSC: The Winograd Schema Challenge tests the models ability to do coreference resolution on sentences with a pronoun and a list of noun phrases found in the sentence. This version edits it to be a binary classification on examples consisting of a pronoun and noun phrase.(Levesque et al., PKRR 2012)
2 https://www.quora.com/profile/Ricky-Riche-2/First-Quora-Dataset-Release-Question-Pairs
Metrics
The metrics used to evaluate the model are the following:
- Validation Data
- Cross-entropy loss on the masked tokens
- Zero-shot
- Accuracy on predicting the correct completion/sentence for BLiMP, BLiMP Supplement, EWoK, Entity Tracking, and WUGs
- Change in R^2 prediction from baseline for Eye Tracking (with no spillover) and Self-paced Reading (1-word spillover)
- Finetuning
- 3 class Accuracy for MNLI
- Binary Accuracy for BoolQ, MultiRC, and WSC
- F1-score for MRPC and QQP
The metrics were chosen based on the advice of the papers the tasks come from.
Hyperparameters
Hyperparameter | MNLI, RTE, QQP, MRPC | BoolQ, MultiRC | WSC |
---|---|---|---|
Learning Rate | 3*10-5 | 3*10-5 | 3*10-5 |
Batch Size | 32 | 16 | 32 |
Epochs | 10 | 10 | 30 |
Weight decay | 0.01 | 0.01 | 0.01 |
Optimizer | AdamW | AdamW | AdamW |
Scheduler | cosine | cosine | cosine |
Warmup percentage | 6% | 6% | 6% |
Dropout | 0.1 | 0.1 | 0.1 |
Results
Validation (Loss)
- 3.10
Zero-shot
Task | Metric | Causal Score | MNTP Score |
---|---|---|---|
BLiMP | Acc | 71.66 | 69.07 |
BLiMP Supplement | Acc | 63.21 | 64.33 |
EWoK | Acc | 49.49 | 49.62 |
Eye Tracking | change in R^2 | 9.89 | 9.47 |
Self-paced Reading | change in R^2 | 3.45 | 3.48 |
Entity Tracking | Acc | 33.96 | 39.17 |
WUGs | Acc | 43.00 | 43.00 |
Finetuning
Task | Metric | Score |
---|---|---|
BoolQ | Acc | |
MNLI | Acc | |
MRPC | F1 | |
QQP | F1 | |
MultiRC | Acc | |
RTE | Acc | |
WSC | Acc |
Technical Specifications
Model Architecture and Objective
The model architecture used is based on the GPT-BERT (Charpentier & Samuel, CoNLL-BabyLM 2024) architecture (based off the LTG-BERT (Samuel et al., Findings 2023) architecture). We train on two objectives Masked Next Token Prediction and Causal Language Modeling. During the training we had 15 examples with the causal objective for every 1 example of the MNTP objective.
Compute Infrastructure
We use the LUMI supercomputer to train this model.
We acknowledge Norway for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through Sigma2.
The computations were performed on resources provided by Sigma2 - the National Infrastructure for High-Performance Computing and Data Storage in Norway
Hardware
- 8 AMD MI250X GPUs (each are split into two compute units, functionally working as 16 GPUs)
Software
PyTorch
Training Time
The model took 40 minutes to train (which equates to 10.67 GPU-hours).
Citation
@misc{charpentier2025babylmturns3papers,
title={BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop},
author={Lucas Charpentier and Leshem Choshen and Ryan Cotterell and Mustafa Omer Gul and Michael Hu and Jaap Jumelet and Tal Linzen and Jing Liu and Aaron Mueller and Candace Ross and Raj Sanjay Shah and Alex Warstadt and Ethan Wilcox and Adina Williams},
year={2025},
eprint={2502.10645},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.10645},
}
Model Card Authors
Lucas Georges Gabriel Charpentier
Bibliography
BERT or GPT: why not both? (Charpentier & Samuel, CoNLL-BabyLM 2024)
Trained on 100 million words and still in shape: BERT meets British National Corpus (Samuel et al., Findings 2023)
GLUE: A multi-task benchmark and analysis platform for natural language understanding (Wang et al., ICLR 2019)
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., NeurIPS 2019)
BLiMP: The Benchmark of Linguistic Minimal Pairs for English (Warstadt et al., TACL 2020)
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora (Warstadt et al., CoNLL-BabyLM 2023)
π Elements of World Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in language models (Ivanova et al., 2024)
Cloze probability, predictability ratings, and computational estimates for 205 English sentences, aligned with existing EEG and reading time data (de Varda et al., BRM 2024)
Entity Tracking in Language Models (Kim & Schuster, ACL 2023)
Derivational Morphology Reveals Analogical Generalization in Large Language Models (Hofmann et al., 2024)
Automatically Constructing a Corpus of Sentential Paraphrases (Dolan & Brockett, IJCNLP 2005)
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (Williams et al., NAACL 2018)
The Winograd Schema Challenge (Levesque et al., PKRR 2012)
The PASCAL Recognising Textual Entailment Challenge (Dagan et al., Springer 2006)
The Second PASCAL Recognising Textual Entailment Challenge (Bar et al., 2006)
The Third PASCAL Recognizing Textual Entailment Challenge (Giampiccolo et al., 2007)
The Fifth PASCAL Recognizing Textual Entailment Challenge (Bentivogli et al., TAC 2009)
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions (Clark et al., NAACL 2019)
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences (Khashabi et al., NAACL 2018)
- Downloads last month
- 70