Should We Still Pretrain Encoders with Masked Language Modeling?
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder models have traditionally relied on Masked Language Modeling (MLM) pretraining, recent work suggests that decoder models pretrained with Causal Language Modeling (CLM) can also be effectively repurposed as encoders. In our latest paper, we address the question of whether the benefits of CLM are due to the objective itself or to confounding factors like model size or training scale.
We conducted a controlled study: identical model sizes, same amount of pretraining data, and a broad downstream task suite. We evaluated two realistic scenarios: pretraining from scratch and continued pretraining. This setup led us to train over 30 models and run more than 15,000 finetuning evaluations, totaling 110k GPU hours. Our results show that MLM alone isn’t optimal: starting with CLM can significantly boost downstream performance.
Masked or Causal Language Modeling?
We study the effect of different pretraining strategies that combine CLM and MLM objectives. Starting with CLM-only and transitioning to MLM-only setups, we evaluate five hybrid splits: 100% MLM, 75%-25%, 50%-50%, 25%-75%, and 100% CLM. Each configuration is trained under fixed compute budgets of 12k, 22k, and 42k steps. As shown in the figure above, hybrid objectives consistently outperform pure MLM across downstream tasks, although the degree of improvement varies depending on the task and training budget.
Continued Pretraining
Motivated by the strong performance of hybrid training strategies across data scales, we explore a continuous pretraining (CPT) setup. Specifically, we ask whether it’s more effective to adapt a CLM-pretrained model using an MLM objective or to continue MLM training from an MLM-pretrained model. We compare equally sized CLM and MLM models pretrained on the same data and allocate a fixed compute budget for CPT. To keep computational costs manageable, we perform a 22,000-step CPT on the 610M model using a 40% masking ratio.
As shown above, the MLM-adapted CLM model consistently yields better downstream performance. On text classification (TC), where CLM-only models already perform well, performance is maintained and the gap to MLM persists. For question answering (QA) and information retrieval (IR), the gap is effectively closed. On sentence classification (SC), the MLM-adapted CLM model significantly outperforms the MLM-only baseline. This is especially exciting because it shows that, by starting from a state-of-the-art decoder, we could obtain even better encoders, paving the way for cheap and efficient models that require minimal training resources!
Conclusion
Finally, building on this strong empirical evidence, we find that encoder models should not be pretrained exclusively with masked language modeling (MLM) objectives. In particular, our experiments demonstrate that adapting CLM models through subsequent MLM training consistently outperforms continuous MLM training from scratch, suggesting a nuanced interaction between both training strategies. We hope our resource release will enable future work in this space, extending our findings to Vision-Language Models (VLMs), many of which are decoder-based, potentially enhancing their representation learning capabilities.
Open Access and Availability
To support research and real-world applications, we are open-sourcing all the artifacts used in this project, including:
📝 Paper: https://arxiv.org/abs/2507.00994
🤖 Models: https://huggingface.co/MLMvsCLM
💻 Training code : https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM
📊 Evaluation code : https://github.com/hgissbkh/EncodEval/tree/MLM_vs_CLM
Contributors
We thank the entire team without whom this would not have been possible: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, and Pierre Colombo.
We also thank our institutional and industrial partners: Artefact, Diabolocom, Illuin Technology, Unbabel, MICS – CentraleSupélec – Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), and Instituto de Telecomunicações.
We highlight the support of the French government through the France 2030 program as part of the ArGiMi project, the CINES infrastructure, DataIA Institute, and Utter, whose contributions facilitated the completion of this work.