File size: 7,406 Bytes
c9dc663 58282d4 2a9895b c9dc663 58282d4 c9dc663 58282d4 c9dc663 99c34e7 58282d4 d32c825 c9dc663 58282d4 c9dc663 08f0378 256fc5c 298930b 256fc5c 298930b 08f0378 dd6601e 256fc5c c9dc663 298930b 195cc1d 95cc5fc 195cc1d 83ada96 08f0378 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 c9dc663 58282d4 01beb55 58282d4 50a09c7 58282d4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
---
base_model: Qwen/Qwen2.5-VL-3B-Instruct
library_name: peft
datasets:
- nomic-ai/colpali-queries-mined-20250321-by-source
language:
- en
- it
- fr
- de
- es
pipeline_tag: visual-document-retrieval
tags:
- vidore
- colpali
- multimodal_embedding
- multilingual_embedding
- Text-to-Visual Document (T→VD) retrieval
---
# Nomic Embed Multimodal 3B: State-of-the-Art Visual Document Retrieval
`nomic-embed-multimodal-3b` is a dense state-of-the-art multimodal embedding model that excels at visual document retrieval tasks:
- **High Performance**: Achieves 58.8 NDCG@5 on Vidore-v2, outperforming all other similarly sized dense multimodal embedding models.
- **Unified Text-Image Encoding**: Directly encodes interleaved text and images without complex preprocessing
- **Advanced Architecture**: 3B parameter multimodal embedding model
- **Open Weights**: Model weights available for research use
## Performance
| Model | Avg. | ESG Restaurant Human | Econ Macro Multi. | AXA Multi. | MIT Bio | ESG Restaurant Synth. | ESG Restaurant Synth. Multi. | MIT Bio Multi. | AXA | Econ. Macro |
|-------|------|----------------------|-------------------|------------|---------|----------------------|----------------------------|---------------|-----|------------|
| [ColNomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | 62.7 | 73.9 | 54.7 | 61.3 | 66.1 | 57.3 | 56.7 | 64.2 | 68.3 | 61.6 |
| [ColNomic Embed Multimodal 3B](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-3b) | 61.2 | 65.8 | 55.4 | 61.0 | 63.5 | 56.6 | 57.2 | 62.5 | 68.8 | 60.2 |
| T-Systems ColQwen2.5-3B | 59.9 | 72.1 | 51.2 | 60.0 | 65.3 | 51.7 | 53.3 | 61.7 | 69.3 | 54.8 |
| [Nomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/nomic-embed-multimodal-7b) | 59.7 | 65.7 | 57.7 | 59.3 | 64.0 | 49.2 | 51.9 | 61.2 | 66.3 | 63.1 |
| GME Qwen2 7B | 59.0 | 65.8 | 56.2 | 55.4 | 64.0 | 54.3 | 56.7 | 55.1 | 60.7 | 62.9 |
| **Nomic Embed Multimodal 3B** | 58.8 | 59.8 | 57.5 | 58.8 | 62.5 | 49.4 | 49.4 | 58.6 | 69.6 | 63.5 |
| Llama Index vdr-2b-multi-v1 | 58.4 | 63.1 | 52.8 | 61.0 | 60.6 | 50.3 | 51.2 | 56.9 | 68.8 | 61.2 |
| Voyage Multimodal 3 | 55.0 | 56.1 | 55.0 | 59.5 | 56.4 | 47.2 | 46.2 | 51.5 | 64.1 | 58.8 |
## Getting Started
To use `nomic-embed-multimodal-3b`, please install `colpali` from source
```bash
pip install git+https://github.com/illuin-tech/colpali.git
```
```python
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available
from colpali_engine.models import BiQwen2_5, BiQwen2_5_Processor
model_name = "nomic-ai/nomic-embed-multimodal-3b"
model = BiQwen2_5.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="cuda:0", # or "mps" if on Apple Silicon
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()
processor = BiQwen2_5_Processor.from_pretrained(model_name)
# Your inputs
images = [
Image.new("RGB", (128, 128), color="white"),
Image.new("RGB", (64, 32), color="black"),
]
queries = [
"What is the organizational structure for our R&D department?",
"Can you provide a breakdown of last year’s financial performance?",
]
# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score(list(torch.unbind(query_embeddings)), list(torch.unbind(image_embeddings)))
```
## Model Architecture
- **Total Parameters**: 3B
- **Training Approach**: Fine-tuned from Qwen2.5-VL 3B Instruct
- **Architecture Type**: Vision-Language Model with unified text and image input processing
- **Key Innovations**:
- Same-source sampling to create harder in-batch negatives
- Hard negative mining with positive-aware techniques
## Integration with RAG Workflows
Nomic Embed Multimodal 3B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:
1. **Direct Document Embedding**: Skip OCR and complex processing by directly embedding document page images
2. **Faster Processing**: Eliminate preprocessing steps for quicker indexing
3. **More Complete Information**: Capture both textual and visual cues in a single embedding
4. **Simple Implementation**: Use the same API for both text and images
## Recommended Use Cases
The model excels at handling real-world document retrieval scenarios that challenge traditional text-only systems:
- **Research Papers**: Capture equations, diagrams, and tables
- **Technical Documentation**: Encode code blocks, flowcharts, and screenshots
- **Product Catalogs**: Represent images, specifications, and pricing tables
- **Financial Reports**: Embed charts, graphs, and numerical data
- **Visually Rich Content**: Where layout and visual information are important
- **Multilingual Documents**: Where visual context provides important cues
## Training Details
Nomic Embed Multimodal 3B was developed through several key innovations:
1. **Sampling From the Same Source**: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts.
2. **Hard Negative Mining**: Using an initial model to retrieve top-k nearest neighbors for each query, then incorporating these hard negatives into training.
3. **Positive-aware Hard Negative Mining**: Reducing false negatives using techniques introduced in NV-Retriever.
## Limitations
- Performance may vary when processing documents with unconventional layouts or unusual visual elements
- While it handles multiple languages, performance is strongest on English content
- Processing very large or complex documents may require dividing them into smaller chunks
- Performance on documents with handwriting or heavily stylized fonts may be reduced
## Join the Nomic Community
- Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- Website: [https://nomic.ai](https://nomic.ai)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
## Citation
If you find this model useful in your research or applications, please consider citing:
```bibtex
@misc{faysse2024colpaliefficientdocumentretrieval,
title={ColPali: Efficient Document Retrieval with Vision Language Models},
author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
year={2024},
eprint={2407.01449},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2407.01449},
}
@misc{ma2024unifyingmultimodalretrievaldocument,
title={Unifying Multimodal Retrieval via Document Screenshot Embedding},
author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
year={2024},
eprint={2406.11251},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2406.11251},
}
@misc{nomicembedmultimodal2025,
title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
author={Nomic Team},
year={2025},
publisher={Nomic AI},
url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}
``` |