--- library_name: transformers license: apache-2.0 pipeline_tag: visual-document-retrieval base_model: - openai/clip-vit-large-patch14 datasets: - aimagelab/ReT-M2KR --- # Model Card for Model ID ReT is a novel approach for multimodal document retrieval that supports both multimodal queries and documents. Unlike existing methods that only use features from the final layer of vision-and-language backbones, ReT employs a Transformer-based recurrent cell to leverage multi-level representations from different layers of both visual and textual backbones. The model features sigmoidal gates inspired by LSTM design that selectively control information flow between layers and modalities. ReT processes multimodal queries and documents independently, producing sets of latent tokens used for fine-grained late interaction similarity computation. ReT is designed to process images and text in both queries and documents. To this end, it has been trained and evaluated on a custom version of the challenging [M2KR](https://arxiv.org/abs/2402.08327) benchmark, with the following modifications: MSMARCO has been excluded as it does not contain images, and the documents from OVEN, InfoSeek, E-VQA, and OKVQA have been enriched with the addition of images. ### Model Sources - **Repository:** https://github.com/aimagelab/ReT - **Paper:** [Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval](https://www.arxiv.org/abs/2503.01980) (CVPR 2025) ### Use with Transformers Follow the instructions on the [repository](https://github.com/aimagelab/ReT) to install the required environment. ```python from src.models import RetrieverModel, RetModel import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' retriever = RetrieverModel.from_pretrained('aimagelab/ReT-CLIP-ViT-L-14', device_map=device) # QUERY ret: RetModel = retriever.get_query_model() ret.init_tokenizer_and_image_processor() q_txt = "Retrieve documents that provide an answer to the question alongside the image: What is the content of the image?" q_img = 'assets/model.png' ret_feats = ret.get_ret_features([[q_txt, q_img]]) print(ret_feats.shape) # torch.Size([1, 32, 128]) # PASSAGE ret: RetModel = retriever.get_passage_model() ret.init_tokenizer_and_image_processor() p_txt = """The image shows a diagram of what appears to be a neural network architecture using a fine-grained loss approach for multimodal learning. The architecture has two parallel processing streams labeled "ReTQ" (left side, in purple) and "ReTD" (right side, in blue). Each side has: ...""" p_img = '' ret_feats = ret.get_ret_features([[p_txt, p_img]]) print(ret_feats.shape) # torch.Size([1, 32, 128]) ``` ## Citation **BibTeX:** ``` @inproceedings{caffagni2025recurrence, title={{Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval}}, author={Caffagni, Davide and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025} } ```