ModernVBERT: Towards Smaller Visual Document Retrievers

Community Article Published October 3, 2025

thumbnail

TL;DR

This blog post introduces ModernVBERT, a 250M vision-language retriever, state-of-the-art for its size category on document retrieval. ModernVBERT is small, fast, and fully open-source. All model checkpoints, training datasets and training recipes are released under MIT license.

modernvbert pareto frontier

What’s The Matter With Current Retrievers?

Most recent visual encoders are built by adapting large vision-language models (VLMs). These models dominate retrieval benchmarks, surpassing contrastively pretrained dual encoders by leveraging their strength in complex multimodal reasoning and their ability to transfer knowledge across diverse visual tasks.

Yet, they inherit a fundamental limitation from their generative roots: causal attention. Designed for text generation, VLMs process tokens only in a forward direction, predicting the next token from past ones. While effective for generation, this hampers embedding quality for retrieval. By ignoring future context, they might fail to capture relationships that retrieval requires.

causal vs bidirectional attention

That observation led us to a crucial question: What if we built a visual retriever from scratch, with bidirectional attention at its core?

Bidirectional or Not Bidirectional?

causal masks vidore

So, does bidirectional attention really make a difference for retrieval? To find out, we set up a controlled experiment where the only change was the attention scheme across models. The results were surprising: in single-vector retrieval, the gap was negligible. But once we moved to multi-vector settings the benefits of bidirectional attention jumped out. On standard document retrieval benchmarks with Late Interaction matching, bidirectional encoders scored a +10.6 nDCG@5 boost over causal decoders.

Takeaway: Bidirectional attention is crucial to unlock the full power of Late Interaction.

Training Tricks That Make the Difference

Bidirectional attention is only half the story. In practice, other training choices can have major impact on downstream performance:

High-Resolution Inputs for Document Tasks

image resolution impact on retrieval

When it comes to documents, detail matters. Training with higher image resolutions (up to 2048px) consistently improved performance on document retrieval tasks. Additional gains were observed when increasing the resolution at the end of the modality alignment phase (HR cooldown), yielding a +49.2% relative improvement compared to the visual encoder resolution. This allowed the model to better capture fine-grained text and layout signals.

Smart Data Mixing Pays Off

Not enough “document + query” pairs? No problem. By augmenting with text-only pairs, we gained +1.7 nDCG@5 on document retrieval benchmarks. The model was able to transfer its understanding across modalities, and since text-only training is cheap and scalable, it boosted performance without slowing training.

Takeaway: It isn’t just about architecture, it’s also about thoughtful training choices. Small tweaks, like resolution and data mixing, can compound into important gains.

Here Comes ModernVBERT

With all that in mind, we scale up the recipe to train ModernVBERT, a 250M parameter visual encoder based on the text-encoder Ettin-150M (ModernBERT architecture).

image resolution impact on retrieval

The outcome? When finetuned for visual document retrieval tasks, ModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.

Why Small Retrievers Matters?

While large VLM-based retrievers still achieve state-of-the-art results, they’re often impractical in real-world scenarios where expensive compute resources are not available at inference. Thanks to its compact design, ModernVBERT enables fast CPU query encoding. The model is running up to 86% faster than models with comparable performance when encoding queries. That efficiency unlocks new use cases for visual retrieval and closes the model size gap to popular text retrievers, which in practice are often used with CPUs.

Links

Authors

  • Paul Teiletche (✖️@pteiletche)
  • Quentin Macé (✖️@MaceQuent1)
  • Max Conti (✖️@mlpc123)
  • Antonio Loison (✖️@antonio_loison)
  • Manuel Faysse (✖️@ManuelFaysse)

Citation

You can cite us in the following way:

@misc{teiletche2025modernvbertsmallervisualdocument,
      title={ModernVBERT: Towards Smaller Visual Document Retrievers}, 
      author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
      year={2025},
      eprint={2510.01149},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2510.01149}, 
}

Acknowledgments

This work was carried out within the framework of the LIAGORA ”LabCom”, a joint laboratory supported by the French National Research Agency (ANR) and established between ILLUIN Tech- nology and the MICS laboratory of CentraleSupélec. This work was performed using HPC resources from IDRIS with grant AD011016393. We warmly thank Hippolyte Gisserot-Boukhlef and Nicolas Boizard for sharing the controlled experiments LM checkpoints, Antoine Chaffin for his feedback on the modality alignment codebase and insights on Ettin’s modeling, as well as the SmolVLM team for their valuable input and help on gathering the modality alignment dataset.

Community

Fantastic work!
What are the steps required to export an ONNX-model for this?
Same as ColPali?
Created merged model -> Convert to HF weights -> Export?
Would love to help make this happen with some guidance. It would greatly increase adoption imho.

Great work. Fully open source? Looking forward to checking out the code and your great work. Ty.

·

Yes fully open source: models, datasets and codebase soon. The codebase will mostly contain the training configs. The trainers are already available as built on open-source projects:

Great work? Any plans for multilingual version?

·
Article author

Thanks! Unfortunately not for the moment, but let's see if someone can build one soon!

Great job!

Sign up or log in to comment