File size: 7,406 Bytes
c9dc663
 
 
58282d4
 
 
 
 
 
 
 
 
 
 
 
2a9895b
 
 
c9dc663
 
58282d4
c9dc663
58282d4
c9dc663
99c34e7
58282d4
 
d32c825
c9dc663
58282d4
c9dc663
08f0378
256fc5c
 
298930b
 
256fc5c
298930b
08f0378
dd6601e
256fc5c
 
c9dc663
298930b
195cc1d
 
 
 
 
95cc5fc
195cc1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83ada96
08f0378
58282d4
c9dc663
58282d4
 
 
 
 
 
c9dc663
58282d4
c9dc663
58282d4
c9dc663
58282d4
 
 
 
c9dc663
58282d4
c9dc663
58282d4
c9dc663
58282d4
 
 
 
 
 
c9dc663
 
 
58282d4
c9dc663
58282d4
c9dc663
58282d4
c9dc663
58282d4
c9dc663
 
58282d4
c9dc663
58282d4
 
 
 
c9dc663
58282d4
c9dc663
58282d4
 
 
 
c9dc663
58282d4
c9dc663
58282d4
c9dc663
58282d4
01beb55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58282d4
 
 
 
 
50a09c7
58282d4
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
base_model: Qwen/Qwen2.5-VL-3B-Instruct
library_name: peft
datasets:
- nomic-ai/colpali-queries-mined-20250321-by-source
language:
- en
- it
- fr
- de
- es
pipeline_tag: visual-document-retrieval
tags:
- vidore
- colpali
- multimodal_embedding
- multilingual_embedding
- Text-to-Visual Document (T→VD) retrieval
---

# Nomic Embed Multimodal 3B: State-of-the-Art Visual Document Retrieval

`nomic-embed-multimodal-3b` is a dense state-of-the-art multimodal embedding model that excels at visual document retrieval tasks:

- **High Performance**: Achieves 58.8 NDCG@5 on Vidore-v2, outperforming all other similarly sized dense multimodal embedding models.
- **Unified Text-Image Encoding**: Directly encodes interleaved text and images without complex preprocessing
- **Advanced Architecture**: 3B parameter multimodal embedding model
- **Open Weights**: Model weights available for research use

## Performance


| Model | Avg. | ESG Restaurant Human | Econ Macro Multi. | AXA Multi. | MIT Bio | ESG Restaurant Synth. | ESG Restaurant Synth. Multi. | MIT Bio Multi. | AXA | Econ. Macro |
|-------|------|----------------------|-------------------|------------|---------|----------------------|----------------------------|---------------|-----|------------|
| [ColNomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-7b) | 62.7 | 73.9 | 54.7 | 61.3 | 66.1 | 57.3 | 56.7 | 64.2 | 68.3 | 61.6 |
| [ColNomic Embed Multimodal 3B](https://huggingface.co/nomic-ai/colnomic-embed-multimodal-3b) | 61.2 | 65.8 | 55.4 | 61.0 | 63.5 | 56.6 | 57.2 | 62.5 | 68.8 | 60.2 |
| T-Systems ColQwen2.5-3B | 59.9 | 72.1 | 51.2 | 60.0 | 65.3 | 51.7 | 53.3 | 61.7 | 69.3 | 54.8 |
| [Nomic Embed Multimodal 7B](https://huggingface.co/nomic-ai/nomic-embed-multimodal-7b) | 59.7 | 65.7 | 57.7 | 59.3 | 64.0 | 49.2 | 51.9 | 61.2 | 66.3 | 63.1 |
| GME Qwen2 7B | 59.0 | 65.8 | 56.2 | 55.4 | 64.0 | 54.3 | 56.7 | 55.1 | 60.7 | 62.9 |
| **Nomic Embed Multimodal 3B** | 58.8 | 59.8 | 57.5 | 58.8 | 62.5 | 49.4 | 49.4 | 58.6 | 69.6 | 63.5 |
| Llama Index vdr-2b-multi-v1 | 58.4 | 63.1 | 52.8 | 61.0 | 60.6 | 50.3 | 51.2 | 56.9 | 68.8 | 61.2 |
| Voyage Multimodal 3 | 55.0 | 56.1 | 55.0 | 59.5 | 56.4 | 47.2 | 46.2 | 51.5 | 64.1 | 58.8 |


## Getting Started

To use `nomic-embed-multimodal-3b`, please install `colpali` from source

```bash
pip install git+https://github.com/illuin-tech/colpali.git
```


```python
import torch
from PIL import Image
from transformers.utils.import_utils import is_flash_attn_2_available

from colpali_engine.models import BiQwen2_5, BiQwen2_5_Processor

model_name = "nomic-ai/nomic-embed-multimodal-3b"

model = BiQwen2_5.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = BiQwen2_5_Processor.from_pretrained(model_name)

# Your inputs
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score(list(torch.unbind(query_embeddings)), list(torch.unbind(image_embeddings)))
```

## Model Architecture

- **Total Parameters**: 3B
- **Training Approach**: Fine-tuned from Qwen2.5-VL 3B Instruct
- **Architecture Type**: Vision-Language Model with unified text and image input processing
- **Key Innovations**:
  - Same-source sampling to create harder in-batch negatives
  - Hard negative mining with positive-aware techniques

## Integration with RAG Workflows

Nomic Embed Multimodal 3B seamlessly integrates with Retrieval Augmented Generation (RAG) workflows:

1. **Direct Document Embedding**: Skip OCR and complex processing by directly embedding document page images
2. **Faster Processing**: Eliminate preprocessing steps for quicker indexing
3. **More Complete Information**: Capture both textual and visual cues in a single embedding
4. **Simple Implementation**: Use the same API for both text and images

## Recommended Use Cases

The model excels at handling real-world document retrieval scenarios that challenge traditional text-only systems:

- **Research Papers**: Capture equations, diagrams, and tables
- **Technical Documentation**: Encode code blocks, flowcharts, and screenshots
- **Product Catalogs**: Represent images, specifications, and pricing tables
- **Financial Reports**: Embed charts, graphs, and numerical data
- **Visually Rich Content**: Where layout and visual information are important
- **Multilingual Documents**: Where visual context provides important cues

## Training Details

Nomic Embed Multimodal 3B was developed through several key innovations:

1. **Sampling From the Same Source**: Forcing sampling from the same dataset source creates harder in-batch negatives, preventing the model from learning dataset artifacts.

2. **Hard Negative Mining**: Using an initial model to retrieve top-k nearest neighbors for each query, then incorporating these hard negatives into training.

3. **Positive-aware Hard Negative Mining**: Reducing false negatives using techniques introduced in NV-Retriever.


## Limitations

- Performance may vary when processing documents with unconventional layouts or unusual visual elements
- While it handles multiple languages, performance is strongest on English content
- Processing very large or complex documents may require dividing them into smaller chunks
- Performance on documents with handwriting or heavily stylized fonts may be reduced

## Join the Nomic Community

- Nomic Embed Ecosystem: [https://www.nomic.ai/embed](https://www.nomic.ai/embed)
- Website: [https://nomic.ai](https://nomic.ai)
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)

## Citation

If you find this model useful in your research or applications, please consider citing:

```bibtex
@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
@misc{ma2024unifyingmultimodalretrievaldocument,
      title={Unifying Multimodal Retrieval via Document Screenshot Embedding}, 
      author={Xueguang Ma and Sheng-Chieh Lin and Minghan Li and Wenhu Chen and Jimmy Lin},
      year={2024},
      eprint={2406.11251},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2406.11251}, 
}
@misc{nomicembedmultimodal2025,
  title={Nomic Embed Multimodal: Interleaved Text, Image, and Screenshots for Visual Document Retrieval},
  author={Nomic Team},
  year={2025},
  publisher={Nomic AI},
  url={https://nomic.ai/blog/posts/nomic-embed-multimodal},
}
```