Should We Still Pretrain Encoders with Masked Language Modeling?

Community Article Published July 2, 2025

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder models have traditionally relied on Masked Language Modeling (MLM) pretraining, recent work suggests that decoder models pretrained with Causal Language Modeling (CLM) can also be effectively repurposed as encoders. In our latest paper, we address the question of whether the benefits of CLM are due to the objective itself or to confounding factors like model size or training scale.

image/png

We conducted a controlled study: identical model sizes, same amount of pretraining data, and a broad downstream task suite. We evaluated two realistic scenarios: pretraining from scratch and continued pretraining. This setup led us to train over 30 models and run more than 15,000 finetuning evaluations, totaling 110k GPU hours. Our results show that MLM alone isn’t optimal: starting with CLM can significantly boost downstream performance.

Masked or Causal Language Modeling?

image/png

We study the effect of different pretraining strategies that combine CLM and MLM objectives. Starting with CLM-only and transitioning to MLM-only setups, we evaluate five hybrid splits: 100% MLM, 75%-25%, 50%-50%, 25%-75%, and 100% CLM. Each configuration is trained under fixed compute budgets of 12k, 22k, and 42k steps. As shown in the figure above, hybrid objectives consistently outperform pure MLM across downstream tasks, although the degree of improvement varies depending on the task and training budget.

Continued Pretraining

Motivated by the strong performance of hybrid training strategies across data scales, we explore a continuous pretraining (CPT) setup. Specifically, we ask whether it’s more effective to adapt a CLM-pretrained model using an MLM objective or to continue MLM training from an MLM-pretrained model. We compare equally sized CLM and MLM models pretrained on the same data and allocate a fixed compute budget for CPT. To keep computational costs manageable, we perform a 22,000-step CPT on the 610M model using a 40% masking ratio.

image/png

As shown above, the MLM-adapted CLM model consistently yields better downstream performance. On text classification (TC), where CLM-only models already perform well, performance is maintained and the gap to MLM persists. For question answering (QA) and information retrieval (IR), the gap is effectively closed. On sentence classification (SC), the MLM-adapted CLM model significantly outperforms the MLM-only baseline. This is especially exciting because it shows that, by starting from a state-of-the-art decoder, we could obtain even better encoders, paving the way for cheap and efficient models that require minimal training resources!

Conclusion

Finally, building on this strong empirical evidence, we find that encoder models should not be pretrained exclusively with masked language modeling (MLM) objectives. In particular, our experiments demonstrate that adapting CLM models through subsequent MLM training consistently outperforms continuous MLM training from scratch, suggesting a nuanced interaction between both training strategies. We hope our resource release will enable future work in this space, extending our findings to Vision-Language Models (VLMs), many of which are decoder-based, potentially enhancing their representation learning capabilities.

Open Access and Availability

To support research and real-world applications, we are open-sourcing all the artifacts used in this project, including:

📝 Paper: https://arxiv.org/abs/2507.00994

🤖 Models: https://huggingface.co/MLMvsCLM

💻 Training code : https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM

📊 Evaluation code : https://github.com/hgissbkh/EncodEval/tree/MLM_vs_CLM

Contributors

We thank the entire team without whom this would not have been possible: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, and Pierre Colombo.

We also thank our institutional and industrial partners: Artefact, Diabolocom, Illuin Technology, Unbabel, MICS – CentraleSupélec – Université Paris-Saclay, Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit), and Instituto de Telecomunicações.

We highlight the support of the French government through the France 2030 program as part of the ArGiMi project, the CINES infrastructure, DataIA Institute, and Utter, whose contributions facilitated the completion of this work.

Community

See my comment here:

https://huggingface.co/papers/2507.00994

For token classification.

The problem is that the main assumption "CLM is better than MLM" for token classification is only true, when the EuroBERT architecture is used.

EuroBERT is super bad for token classification, XLM-R or DeBERTa are much better architectures for token classification, as it can be seen in Table 1 of the EuroBERT paper (https://arxiv.org/abs/2503.05500).

So yes, for token classification we should still pretrain encoders with MLM, as long as they are not {Neo,Modern,Euro}BERT based :)

·
Article author

Hello,

Thanks for your question!

You're absolutely right, the EuroBERT architecture isn't ideal for token classification, largely due to its tokenizer, as we explain in our paper: https://arxiv.org/abs/2503.05500.
However, fixing the architecture, even if it’s not optimal for every task, is a key part of our experimental design, as it allows us to isolate the impact of the training objective. Under this controlled setup, we show that CLM outperforms MLM for token classification, and we believe this finding is generalizable.
As for the choice of architecture, alternatives like RoBERTa or DeBERTa clearly underperform on retrieval, and EuroBERT emerged as a reasonable overall compromise.

Cheers!

Sign up or log in to comment