llama-nemoretriever-colembed-1b-v1

Description

The nvidia/llama-nemoretriever-colembed-1b-v1 is a late interaction embedding model fine-tuned for query-document retrieval. Users can input queries, which are text, or documents which are page images, to the model. The model outputs ColBERT-style multi-vector numerical representations for input queries and documents. It is the smaller version of llama-nemoretriever-colembed-3b-v1, which achieved 1st place on ViDoRe V1 (nDCG@5), ViDoRe V2 (nDCG@5) and MTEB VisualDocumentRetrieval (Rank Borda) (as of 27th June, 2025). nvidia/llama-nemoretriever-colembed-1b-v1 achieves 2nd place on the benchmarks.

This model is for non-commercial/research use only.

License/Terms of Use

Governing Terms for llama-nemoretriever-colembed-1b-v1 model: NVIDIA Non-Commercial License
Additional Information: Apache License 2.0 for siglip2-giant-opt-patch16-384; and LLAMA 3.2 Community License Agreement for Llama-3.2-1B. Built with Meta Llama 3. Improved using Qwen.

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Team

Mengyao Xu
Gabriel Moreira
Radek Osmulski
Ronay Ak
Yauhen Babakhin
Even Oldridge
Benedikt Schifferer

Correspondence to Mengyao Xu ([email protected]) and Benedikt Schifferer ([email protected])

Citation

Will be published soon

NVIDIA’s Retrieval Models

Model Name	Use-Case	Comment
nvidia/llama-NemoRetriever-ColEmbed-1B-v1	Research-Only	Smaller Version of nvidia/llama-NemoRetriever-ColEmbed-3B-v1
nvidia/llama-NemoRetriever-ColEmbed-3B-v1	Research-Only	#1 ViDoRe V1, V2 and MTEB VisualDocumentRetrieval as of June 27, 2025
llama-3_2-nemoretriever-1b-vlm-embed-v1	Commercial Application	MultiModal Embedding Model for Production Use-Case of Visual Document Retrieval
llama-3_2-nv-embedqa-1b-v2	Commercial Application	Text Embedding Model for Production Use-Case of Text Document Retrieval
llama-3_2-nemoretriever-500m-rerank-v2	Commercial Application	Text Reranker Model for Production Use-Case of Text Document Retrieval
llama-3_2-nv-rerankqa-1b-v2	Commercial Application	Text Reranker Model for Production Use-Case of Text Document Retrieval
nvidia/NV-Embed-v2	Research-Only	#1 MTEB as of Aug 30, 2024
nvidia/MM-Embed	Research-Only	Improved nvidia/NV-Embed-v1 and multimodal embeddings
nvidia/NV-Retriever-v1	Research-Only	#1 MTEB BEIR as of July 12th, 2024

Deployment Geography

Global

Use Case

llama-nemoretriever-colembed is intended for researchers exploring applications that must understand or retrieve information across both text and image modalities. It is instrumental in multimodal RAG systems, where queries are in text format and documents are images, such as pages, text, charts, tables or infographics. Potential applications include multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding.

Release Date

Huggingface on 06/27/2025 via https://huggingface.co/nvidia/llama-nemoretriever-colembed-1b-v1

Model Architecture

Architecture Type: Transformer
Network Architecture: google/siglip2-giant-opt-patch16-384 + meta-llama/Llama-3.2-1B

The llama-nemoretriever-colembed-1b-v1 is a transformer-based multimodal embedding model built on top of a VLM based on google/siglip2-giant-opt-patch16-384 and meta-llama/Llama-3.2-1B.

Input

Property	Query	Document
Input Type	Text	Text \| Image
Input Format	List of strings	List of strings \| List of Images
Input Parameter	1D	1D
Other Properties	The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated.	The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. Images must be python PIL format. The model will scale the image into multiple tiles of 512x512.

Output

Output Type: Floats
Output Format: List of float arrays
Output Parameters: The list of floats equivalent to [batchsize x seq length x embedding_dim]
Other Properties Related to Output: Model outputs embedding vectors of dimension for each input token.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Usage

The model requires transformers version 4.49.0 and flash attention

pip install transformers==4.49.0
pip install flash-attn==2.6.3 --no-build-isolation

import requests
from PIL import Image
from io import BytesIO
import torch
from transformers import AutoModel

# Load Model
model = AutoModel.from_pretrained(
    'nvidia/llama-nemoretriever-colembed-1b-v1',
    device_map='cuda',
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    revision='1f0fdea7f5b19532a750be109b19072d719b8177'
).eval()

# Queries
queries = [
    'How much percentage of Germanys population died in the 2nd World War?',
    'How many million tons CO2 were captured from Gas processing in 2018?',
    'What is the average CO2 emission of someone in Japan?'
]

# Documents
image_urls = [
    'https://upload.wikimedia.org/wikipedia/commons/3/35/Human_losses_of_world_war_two_by_country.png',
    'https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/20210413_Carbon_capture_and_storage_-_CCS_-_proposed_vs_implemented.svg/2560px-20210413_Carbon_capture_and_storage_-_CCS_-_proposed_vs_implemented.svg.png',
    'https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/20210626_Variwide_chart_of_greenhouse_gas_emissions_per_capita_by_country.svg/2880px-20210626_Variwide_chart_of_greenhouse_gas_emissions_per_capita_by_country.svg.png'
]

# Load into PIL
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

images = [Image.open(BytesIO(requests.get(image_url, headers=headers).content)) for image_url in image_urls]

# Encoding
query_embeddings = model.forward_queries(queries, batch_size=8)
passage_embeddings = model.forward_passages(images, batch_size=8)


scores = model.get_scores(
    query_embeddings,
    passage_embeddings
)
# Diagonal should have high scores
print(scores)
# tensor([[13.9970, 11.4219, 12.1225],
#         [11.4157, 14.6388, 12.0341],
#         [ 9.9023,  9.8857, 11.3387]], device='cuda:0')

The HuggingFace model artifact contains a script to evaluate ViDoRe V1 and ViDoRe V2 based on the GitHub repository

pip install git+https://github.com/illuin-tech/vidore-benchmark@e0eb9032e7e00adc8aa6f9cb35d5a9371f67485a
# Downgrade transformers as vidore will install latest transformers
pip install transformers==4.49.0
CUDA_VISIBLE_DEVICES=0; python3 vidore_eval.py --model_name_or_path nvidia/llama-nemoretriever-colembed-1b-v1 --savedir_datasets ./results/ --model_revision 1f0fdea7f5b19532a750be109b19072d719b8177

The HuggingFace model artifact contains a script to evaluate MTEB VisualDocumentRetrieval. We install ViDoRe benchmark to capture dependencies, first.

pip install git+https://github.com/illuin-tech/vidore-benchmark@e0eb9032e7e00adc8aa6f9cb35d5a9371f67485a
pip install transformers==4.49.0
# Install MTEB PR, which contains model meta data
pip install git+https://github.com/embeddings-benchmark/mteb
CUDA_VISIBLE_DEVICES=0; python3 mteb_eval.py --model_name_or_path nvidia/llama-nemoretriever-colembed-1b-v1

Software Integration:

Runtime Engine(s): TensorRT, Triton Supported Hardware Microarchitecture Compatibility: A100 40GB, A100 80GB, H100 80GB Supported Operating System(s): Linux

Model Version(s)

llama-nemoretriever-colembed-1b-v1

Training and Evaluation Datasets

The total size (in number of data points) 12.74M qa pairs for training
Total number of datasets: 23 datasets used for training and 17 datasets used for evaluation.

Training Dataset

The model was trained on publicly available datasets, including HotpotQA, MIRACL, Natural Questions (NQ), Stack Exchange, SQuAD, Tiger Math/Stack, DocMatix-IR, VDR, Vidore-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, VisRAG-Ret-Train-In-domain-data, and Wiki-SS-NQ.

Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties: Training: 1st Stage: 12M qa pairs, 2nd Stage: 500k qa pairs, 3rd Stage: 240k qa pairs

Evaluation Dataset

We evaluate the model on multiple benchmarks for Visual Document Retrieval, ViDoRe V1, ViDoRe V2 and MTEB Visual Document Retrieval.

Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties: More details on ViDoRe V1 and ViDoRe V2 can be found on their leaderboard. Visual Document Retrieval Benchmark, ViDoRe, is composed of various page-level retrieving tasks spanning multiple domains, languages, and settings.

Benchmark	Model 1B	Model 3B
ViDoRe V1 (06/27/2025)	0.9050	0.9100
ViDoRe V1 (deprecated)	0.9049	0.9098
ViDoRe V2 (06/27/2025)	0.6209	0.6352
ViDoRe V2 (deprecated)	0.6261	0.6342
MTEB Visual Document Retrieval	0.8238	0.8315

Note: All scores are Avg. NDCG@5. ViDoRe V1 and V2 was updated on June 27th 2025 to use the calculated scores from MTEB, which can result in slightly different scores. The ViDoRe V2 (06/27/2025) uses only 4 of the original 7 datasets.

Inference:

Acceleration Engine: Not Applicable
Test Hardware: A100 40GB, A100 80GB, H100 80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.