Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Abstract
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
Community
We're very excited about the release of ModernBERT -- it feels like it could be the basis of all kinds of interesting new startups and research projects.
In fact, the stuff mentioned in the paper and blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.
We remove the Next-Sentence Prediction objective which introduces noticeable overhead for no performance improvement
But this is only half of the truth and mainly copied from the RoBERTa paper.
The other half: ALBERT paper (see Table 5) shows improvement (NSP over None) - not on SQuAD datasets, but on average. Additionally, their approach of introducing a sentence order prediction loss boosts performance on various downstream tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Two are better than one: Context window extension with multi-grained self-injection (2024)
- Why Does the Effective Context Length of LLMs Fall Short? (2024)
- Are Decoder-Only Large Language Models the Silver Bullet for Code Search? (2024)
- Sparse Upcycling: Inference Inefficient Finetuning (2024)
- MiLoRA: Efficient Mixture of Low-Rank Adaptation for Large Language Models Fine-tuning (2024)
- A Survey of Small Language Models (2024)
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Great work, especially for most industry tasks
Thanks for this very welcomed modernisation of 'good old' BERT architecture ;)
However, a big part of the appeal of recent LLM/decoder-only models for a lot of us is their multilingual capability. Would love to see a variant pretrained on more natural languages (instead of code to keep the same training budget, and as the two would be complementary i.e. used for different downstream applications). :)
It would be very interesting to see a training loss curve! Does a 150/300M model really need almost 2T tokens?
Thanks :)
Models citing this paper 19
Browse 19 models citing this paperDatasets citing this paper 0
No dataset linking this paper