ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance
Abstract
Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several benchmarks, the lack of disclosed training data and the absence of comparisons using a shared dataset make it difficult to determine whether these gains are due to architectural improvements or differences in training data. In this work, we conduct a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT's primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as BERT and RoBERTa. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation. These findings show the importance of disentangling pretraining data from architectural innovations when evaluating transformer models.
Community
Great comparison of ModernBERT vs. other architectures.
NER or QA were not mentioned in the ModernBERT paper, and this paper shows very clearly, that ModernBERT has some problems with these kind of tasks.
Additionally, I would be highly interested if someone could train ModernBERT without RoPe to see if this is the real reason for bad performance on NER. (Unfortunately, I don't have multi-GPUs available for this ablation).
Love this. I remember reading about modernbert and noticed the number of tokens it was trained on being super high, especially compared to earlier models.
Thanks!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NeoBERT: A Next-Generation BERT (2025)
- Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations (2025)
- One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings (2025)
- Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks (2025)
- From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models (2025)
- DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation (2025)
- Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper