llama-nemoretriever-colembed-1b-v1
Description
The nvidia/llama-nemoretriever-colembed-1b-v1 is a late interaction embedding model fine-tuned for query-document retrieval. Users can input queries
, which are text, or documents
which are page images, to the model. The model outputs ColBERT-style multi-vector numerical representations for input queries and documents. It is the smaller version of llama-nemoretriever-colembed-3b-v1, which achieved 1st place on ViDoRe V1 (nDCG@5), ViDoRe V2 (nDCG@5) and MTEB VisualDocumentRetrieval (Rank Borda) (as of 27th June, 2025). nvidia/llama-nemoretriever-colembed-1b-v1 achieves 2nd place on the benchmarks.
This model is for non-commercial/research use only.
License/Terms of Use
Governing Terms for llama-nemoretriever-colembed-1b-v1 model: NVIDIA Non-Commercial License
Additional Information: Apache License 2.0 for siglip2-giant-opt-patch16-384; and LLAMA 3.2 Community License Agreement for Llama-3.2-1B. Built with Meta Llama 3. Improved using Qwen.
This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
Team
- Mengyao Xu
- Gabriel Moreira
- Radek Osmulski
- Ronay Ak
- Yauhen Babakhin
- Even Oldridge
- Benedikt Schifferer
Correspondence to Mengyao Xu ([email protected]) and Benedikt Schifferer ([email protected])
Citation
Will be published soon
NVIDIA’s Retrieval Models
Model Name | Use-Case | Comment |
---|---|---|
nvidia/llama-NemoRetriever-ColEmbed-1B-v1 | Research-Only | Smaller Version of nvidia/llama-NemoRetriever-ColEmbed-3B-v1 |
nvidia/llama-NemoRetriever-ColEmbed-3B-v1 | Research-Only | #1 ViDoRe V1, V2 and MTEB VisualDocumentRetrieval as of June 27, 2025 |
llama-3_2-nemoretriever-1b-vlm-embed-v1 | Commercial Application | MultiModal Embedding Model for Production Use-Case of Visual Document Retrieval |
llama-3_2-nv-embedqa-1b-v2 | Commercial Application | Text Embedding Model for Production Use-Case of Text Document Retrieval |
llama-3_2-nemoretriever-500m-rerank-v2 | Commercial Application | Text Reranker Model for Production Use-Case of Text Document Retrieval |
llama-3_2-nv-rerankqa-1b-v2 | Commercial Application | Text Reranker Model for Production Use-Case of Text Document Retrieval |
nvidia/NV-Embed-v2 | Research-Only | #1 MTEB as of Aug 30, 2024 |
nvidia/MM-Embed | Research-Only | Improved nvidia/NV-Embed-v1 and multimodal embeddings |
nvidia/NV-Retriever-v1 | Research-Only | #1 MTEB BEIR as of July 12th, 2024 |
Deployment Geography
Global
Use Case
llama-nemoretriever-colembed is intended for researchers exploring applications that must understand or retrieve information across both text and image modalities. It is instrumental in multimodal RAG systems, where queries are in text format and documents are images, such as pages, text, charts, tables or infographics. Potential applications include multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding.
Release Date
Huggingface on 06/27/2025 via https://huggingface.co/nvidia/llama-nemoretriever-colembed-1b-v1
Model Architecture
- Architecture Type: Transformer
- Network Architecture: google/siglip2-giant-opt-patch16-384 + meta-llama/Llama-3.2-1B
The llama-nemoretriever-colembed-1b-v1 is a transformer-based multimodal embedding model built on top of a VLM based on google/siglip2-giant-opt-patch16-384 and meta-llama/Llama-3.2-1B.
Input
Property | Query | Document |
---|---|---|
Input Type | Text | Text | Image |
Input Format | List of strings | List of strings | List of Images |
Input Parameter | 1D | 1D |
Other Properties | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. Images must be python PIL format. The model will scale the image into multiple tiles of 512x512. |
Output
- Output Type: Floats
- Output Format: List of float arrays
- Output Parameters: The list of floats equivalent to [batchsize x seq length x embedding_dim]
- Other Properties Related to Output: Model outputs embedding vectors of dimension for each input token.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Usage
The model requires transformers version 4.49.0 and flash attention
pip install transformers==4.49.0
pip install flash-attn==2.6.3 --no-build-isolation
import requests
from PIL import Image
from io import BytesIO
import torch
from transformers import AutoModel
# Load Model
model = AutoModel.from_pretrained(
'nvidia/llama-nemoretriever-colembed-1b-v1',
device_map='cuda',
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
revision='1f0fdea7f5b19532a750be109b19072d719b8177'
).eval()
# Queries
queries = [
'How much percentage of Germanys population died in the 2nd World War?',
'How many million tons CO2 were captured from Gas processing in 2018?',
'What is the average CO2 emission of someone in Japan?'
]
# Documents
image_urls = [
'https://upload.wikimedia.org/wikipedia/commons/3/35/Human_losses_of_world_war_two_by_country.png',
'https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/20210413_Carbon_capture_and_storage_-_CCS_-_proposed_vs_implemented.svg/2560px-20210413_Carbon_capture_and_storage_-_CCS_-_proposed_vs_implemented.svg.png',
'https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/20210626_Variwide_chart_of_greenhouse_gas_emissions_per_capita_by_country.svg/2880px-20210626_Variwide_chart_of_greenhouse_gas_emissions_per_capita_by_country.svg.png'
]
# Load into PIL
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}
images = [Image.open(BytesIO(requests.get(image_url, headers=headers).content)) for image_url in image_urls]
# Encoding
query_embeddings = model.forward_queries(queries, batch_size=8)
passage_embeddings = model.forward_passages(images, batch_size=8)
scores = model.get_scores(
query_embeddings,
passage_embeddings
)
# Diagonal should have high scores
print(scores)
# tensor([[13.9970, 11.4219, 12.1225],
# [11.4157, 14.6388, 12.0341],
# [ 9.9023, 9.8857, 11.3387]], device='cuda:0')
The HuggingFace model artifact contains a script to evaluate ViDoRe V1 and ViDoRe V2 based on the GitHub repository
pip install git+https://github.com/illuin-tech/vidore-benchmark@e0eb9032e7e00adc8aa6f9cb35d5a9371f67485a
# Downgrade transformers as vidore will install latest transformers
pip install transformers==4.49.0
CUDA_VISIBLE_DEVICES=0; python3 vidore_eval.py --model_name_or_path nvidia/llama-nemoretriever-colembed-1b-v1 --savedir_datasets ./results/ --model_revision 1f0fdea7f5b19532a750be109b19072d719b8177
The HuggingFace model artifact contains a script to evaluate MTEB VisualDocumentRetrieval. We install ViDoRe benchmark to capture dependencies, first.
pip install git+https://github.com/illuin-tech/vidore-benchmark@e0eb9032e7e00adc8aa6f9cb35d5a9371f67485a
pip install transformers==4.49.0
# Install MTEB PR, which contains model meta data
pip install git+https://github.com/embeddings-benchmark/mteb
CUDA_VISIBLE_DEVICES=0; python3 mteb_eval.py --model_name_or_path nvidia/llama-nemoretriever-colembed-1b-v1
Software Integration:
Runtime Engine(s): TensorRT, Triton Supported Hardware Microarchitecture Compatibility: A100 40GB, A100 80GB, H100 80GB Supported Operating System(s): Linux
Model Version(s)
llama-nemoretriever-colembed-1b-v1
Training and Evaluation Datasets
- The total size (in number of data points) 12.74M qa pairs for training
- Total number of datasets: 23 datasets used for training and 17 datasets used for evaluation.
Training Dataset
The model was trained on publicly available datasets, including HotpotQA, MIRACL, Natural Questions (NQ), Stack Exchange, SQuAD, Tiger Math/Stack, DocMatix-IR, VDR, Vidore-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, VisRAG-Ret-Train-In-domain-data, and Wiki-SS-NQ.
- Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
- Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
- Properties: Training: 1st Stage: 12M qa pairs, 2nd Stage: 500k qa pairs, 3rd Stage: 240k qa pairs
Evaluation Dataset
We evaluate the model on multiple benchmarks for Visual Document Retrieval, ViDoRe V1, ViDoRe V2 and MTEB Visual Document Retrieval.
- Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
- Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
- Properties: More details on ViDoRe V1 and ViDoRe V2 can be found on their leaderboard. Visual Document Retrieval Benchmark, ViDoRe, is composed of various page-level retrieving tasks spanning multiple domains, languages, and settings.
Benchmark | Model 1B | Model 3B |
---|---|---|
ViDoRe V1 (06/27/2025) | 0.9050 | 0.9100 |
ViDoRe V1 (deprecated) | 0.9049 | 0.9098 |
ViDoRe V2 (06/27/2025) | 0.6209 | 0.6352 |
ViDoRe V2 (deprecated) | 0.6261 | 0.6342 |
MTEB Visual Document Retrieval | 0.8238 | 0.8315 |
Note: All scores are Avg. NDCG@5. ViDoRe V1 and V2 was updated on June 27th 2025 to use the calculated scores from MTEB, which can result in slightly different scores. The ViDoRe V2 (06/27/2025) uses only 4 of the original 7 datasets.
Inference:
Acceleration Engine: Not Applicable
Test Hardware: A100 40GB, A100 80GB, H100 80GB
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 19