Papers
arxiv:2412.13663

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Published on Dec 18, 2024
· Submitted by jph00 on Dec 19, 2024
#1 Paper of the day

Abstract

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Community

Paper author Paper submitter

We're very excited about the release of ModernBERT -- it feels like it could be the basis of all kinds of interesting new startups and research projects.

In fact, the stuff mentioned in the paper and blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.

We remove the Next-Sentence Prediction objective which introduces noticeable overhead for no performance improvement

But this is only half of the truth and mainly copied from the RoBERTa paper.

The other half: ALBERT paper (see Table 5) shows improvement (NSP over None) - not on SQuAD datasets, but on average. Additionally, their approach of introducing a sentence order prediction loss boosts performance on various downstream tasks.

I would be interested in the number of hardware that is involved in pretraining the base and large models including pretraining time :)

·
Paper author

Hello,

Everything is included in the Table 3 of the paper (Appendix A)
image.png
Hope it helps!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Great work, especially for most industry tasks

Thanks for this very welcomed modernisation of 'good old' BERT architecture ;)
However, a big part of the appeal of recent LLM/decoder-only models for a lot of us is their multilingual capability. Would love to see a variant pretrained on more natural languages (instead of code to keep the same training budget, and as the two would be complementary i.e. used for different downstream applications). :)

It would be very interesting to see a training loss curve! Does a 150/300M model really need almost 2T tokens?

Thanks :)

Sign up or log in to comment

Models citing this paper 19

Browse 19 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.13663 in a dataset README.md to link it from this page.

Spaces citing this paper 19

Collections including this paper 34