TomoroAI/tomoro-colqwen3-embed-8b
β‘ Executive Summary
TomoroAI/tomoro-colqwen3-embed-8b is a state-of-the-art ColPali-style multimodal embedding model. It maps text queries, visual documents (images, PDFs) or short videos into aligned multi-vector embeddings.
Built by merging Qwen/Qwen3-VL-8B-Instruct with Qwen/Qwen3-Embedding-8B, this model inherits robust text retrieval capabilities while preserving a full vision stack. It has been fine-tuned on a curated mixture of VDR, ViDoRe-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, and VisRAG-Ret-Train-In-domain-data. It achieves SOTA or competitive performance across ViDoRe V1-V3 (English and Multilingual) while offering a significantly reduced embedding footprint compared to other full-dim Colpali model alternatives.
π οΈ Model Specifications
| Feature | Detail |
|---|---|
| Architecture | Qwen3-VL 8B (Encoder-only variant) + 320-dim Projection Head |
| Methodology | ColPali-style Late Interaction (MaxSim scoring) |
| Token Budget | Up to 1,280 visual tokens per page or 5120 visual tokens per video (text prompts constrained only by the base context window) |
| Context Window | 32k (inherited from base), typical usage < 2k tokens |
| Output | Multi-vector (Seq_Len Γ 320), L2-normalized |
| Supported Modalities | Text Queries, RGB Images, Synthetic Documents, Short Video (Frame-wise) |
| Precision | bfloat16 weights, FlashAttention 2 enabled |
Key Properties
- Merged Encoders: Combines the Qwen3-VL vision encoder (patch-grid tokens with spatial merge) and language encoder.
- Projection: A custom 320-dim head projects every token (text or visual) into a vector.
- Processing:
- Queries: Left-padded text sequences.
- Documents: Rendered with a lightweight vision prompt and flattened into image tokens.
- Video: Supports video retrieval by decoding videos into frames and processing via the vision stack (generalization capability, not explicitly fine-tuned; dedicated benchmark coming soon).
- Storage Efficiency:
- Baseline (NVIDIA Nemo-3B): Stores 1,802 tokens @ 3,072 dims (β10.3 TB for 1M images).
- Tomoro ColQwen3: Stores max 1,280 tokens @ 320 dims (β0.82 TB for 1M images).
- Result: 13Γ smaller footprint with higher performance.
π Evaluation Results
We report results on the ViDoRe benchmark suite. The model sets new standards on multilingual and English splits on ViDoRe V2 and V3 while maintaining comparable high performance on ViDoRe V1.
ViDoRe V3 (Latest)
English nDCG@5
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | Avg |
|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.7443 | 0.6491 | 0.6823 | 0.4546 | 0.6421 | 0.5766 | 0.6665 | 0.4747 | 0.6113 |
| tomoro-colqwen3-4b | 0.7419 | 0.6023 | 0.6753 | 0.4202 | 0.6037 | 0.5787 | 0.6612 | 0.4640 | 0.5934 |
| nemo-colembed-3b | 0.7514 | 0.5838 | 0.6712 | 0.3730 | 0.6256 | 0.5447 | 0.6524 | 0.4128 | 0.5769 |
| jinaai/jina-embeddings-v4 | 0.7175 | 0.5842 | 0.6417 | 0.3859 | 0.6206 | 0.5443 | 0.6303 | 0.4191 | 0.5680 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.7528 | 0.5824 | 0.6041 | 0.3877 | 0.6060 | 0.5229 | 0.6226 | 0.4423 | 0.5651 |
Multilingual nDCG@5 (Excluding English Subsets)
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | Avg |
|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.7194 | 0.6619 | 0.6172 | 0.4570 | 0.6097 | 0.5164 | 0.6403 | 0.4706 | 0.5866 |
| tomoro-colqwen3-4b | 0.7213 | 0.6374 | 0.6019 | 0.4305 | 0.5637 | 0.5131 | 0.6351 | 0.4636 | 0.5708 |
| nemo-colembed-3b | 0.7216 | 0.5901 | 0.5646 | 0.4102 | 0.5504 | 0.4335 | 0.6170 | 0.4192 | 0.5383 |
| jinaai/jina-embeddings-v4 | 0.6843 | 0.6036 | 0.5482 | 0.4249 | 0.5542 | 0.4732 | 0.6059 | 0.4381 | 0.5416 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.7333 | 0.6160 | 0.5219 | 0.4169 | 0.5494 | 0.4764 | 0.5938 | 0.4449 | 0.5441 |
ViDoRe V2
English nDCG@5
| Model | BioMed | ESG HL | ESG Rpts | Economics | Avg |
|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.6784 | 0.7598 | 0.6549 | 0.6159 | 0.6772 |
| tomoro-colqwen3-4b | 0.6718 | 0.7465 | 0.6300 | 0.5910 | 0.6598 |
| nemo-colembed-3b | 0.6518 | 0.7538 | 0.6030 | 0.6619 | 0.6676 |
| jinaai/jina-embeddings-v4 | 0.6359 | 0.6512 | 0.5194 | 0.5955 | 0.6005 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.6479 | 0.6871 | 0.5498 | 0.5955 | 0.6201 |
Multilingual nDCG@5
| Model | BioMed | ESG Rpts | Economics | Avg |
|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.6467 | 0.5911 | 0.5875 | 0.6085 |
| tomoro-colqwen3-4b | 0.6478 | 0.6226 | 0.5536 | 0.6080 |
| nemo-colembed-3b | 0.6187 | 0.5640 | 0.5506 | 0.5778 |
| jinaai/jina-embeddings-v4 | 0.5994 | 0.5178 | 0.5364 | 0.5512 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.6224 | 0.5336 | 0.5433 | 0.5664 |
ViDoRe V1 (English nDCG@5)
| Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.9115 | 0.6637 | 0.9448 | 0.8789 | 0.9926 | 0.9671 | 0.9758 | 0.9906 | 0.9423 | 0.8092 | 0.9076 |
| tomoro-colqwen3-4b | 0.9066 | 0.6624 | 0.9429 | 0.8739 | 0.9926 | 0.9691 | 0.9717 | 0.9963 | 0.9433 | 0.7983 | 0.9057 |
| nemo-colembed-3b | 0.8835 | 0.6621 | 0.9492 | 0.9070 | 0.9963 | 0.9663 | 0.9782 | 0.9926 | 0.9594 | 0.8057 | 0.9100 |
| jinaai/jina-embeddings-v4 | 0.8846 | 0.6014 | 0.9379 | 0.9293 | 0.9926 | 0.9726 | 0.9659 | 0.9913 | 0.9560 | 0.8035 | 0.9035 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.8832 | 0.6011 | 0.9221 | 0.8930 | 0.9876 | 0.9626 | 0.9592 | 0.9926 | 0.9596 | 0.8108 | 0.8972 |
Video Retrieval: CareBench Evaluation
To demonstrate that Tomoro ColQwen3 strongly generalizes to video retrieval, we evaluated the models on the CareBench benchmark for text to video (General Retrieval) task.
For this evaluation, we utilized a raw video encoding approach: our models encoded the video files directly without any additional textual annotations or metadata inputs. This highlights the model's ability to perform retrieval based purely on visual semantics.
| Model | Recall@1 | Recall@5 | Recall@10 |
|---|---|---|---|
| tomoro-colqwen3-8b | 0.8670 | 0.9590 | 0.9850 |
| tomoro-colqwen3-4b | 0.8620 | 0.9570 | 0.9800 |
| Care7B | 0.7700 | 0.9560 | 0.9870 |
We will benchmark more video retrieval datasets in the future.
π» Usage
The processor exposes process_texts, process_images, and score_multi_vector.
Prerequisites
pip install torch transformers pillow requests
Inference Code
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO
# Configuration
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load Model & Processor
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
# Sample Data
queries = [
"Retrieve the city of Singapore",
"Retrieve the city of Beijing",
"Retrieve the city of London",
]
docs = [
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
]
def load_image(url: str) -> Image.Image:
# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code == 403:
continue
resp.raise_for_status()
try:
return Image.open(BytesIO(resp.content)).convert("RGB")
except UnidentifiedImageError as e:
raise RuntimeError(f"Failed to decode image from {url}") from e
raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
# Helper Functions
def encode_queries(texts, batch_size=8):
outputs = []
for start in range(0, len(texts), batch_size):
batch = processor.process_texts(texts=texts[start : start + batch_size])
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
def encode_docs(urls, batch_size=4):
pil_images = [load_image(url) for url in urls]
outputs = []
for start in range(0, len(pil_images), batch_size):
batch_imgs = pil_images[start : start + batch_size]
features = processor.process_images(images=batch_imgs)
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
with torch.inference_mode():
out = model(**features)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)
# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
ποΈ Lightweight Video Retrieval
ColQwen3 generalizes to short videos while learning from image-text retrieval task. This minimal example samples a clip with torchvision, encodes queries and frames, then pools frame embeddings with a per-dimension max before MaxSim scoring.
We recommand use of maximum 5120 visual tokens for video retrieval task for best performance.
from pathlib import Path
import torch
from transformers import AutoModel, AutoProcessor
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=5120,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
queries = ["Retrieve the football video", "Find the basketball clip", "Find the swimming clip", "Find the wrestling clip"]
videos = ["/root/sample_videos/football.mp4", "/root/sample_videos/basketball.mp4", "/root/sample_videos/swimming.mp4", "/root/sample_videos/wrestling.mp4"]
def encode_queries(texts):
batch = processor.process_texts(texts=texts)
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
return out.embeddings.to(torch.bfloat16).cpu()
def encode_videos(paths):
vids = [str(Path(p).expanduser()) for p in paths]
feats = processor(
videos=vids,
padding="longest",
return_tensors=None, # keep metadata as Python objects until we drop it
videos_kwargs={"return_metadata": True},
)
feats.pop("video_metadata", None) # drop metadata before forwarding to the model
feats = feats.convert_to_tensors(tensor_type="pt")
feats = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in feats.items()}
with torch.inference_mode():
out = model(**feats)
return out.embeddings.to(torch.bfloat16).cpu()
q_emb = encode_queries(queries)
v_emb = encode_videos(videos)
scores = processor.score_multi_vector(q_emb, v_emb)
print(scores)
βοΈ Strengths & Limitations
Strengths
- Performance: State of the art retrieval performance on ViDoRe V2 & V3 dataset with excellent performance on multimodal document retrieval.
- Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
- End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
- Retrieval Task Transfer: Inherited strong text retrieval performance from the merged vector of the Qwen3-Embedding-8B model.
- Multilingualism: Strong performance on non-English document inputs.
Limitations
- Video Support: The retrieval model generalizes to video retrieval on our preliminary findings, however it's not fine-tuned on large-scale video retrieval datasets, we plan to further improve this in the future.
- Storage Cost: Still larger than singleβvector baselines despite the smaller token dimension.
- Retrieval Instructions: The model currently is not fine-tuned with diverse system instructions similar to Qwen3-Embedding models, we intent to improve this with more synthetic dataset in the future.
License & Data
Distributed under Apache 2.0.
- Weights: Upstream Qwen checkpoints retain their community licenses; ensure compliance when mixing.
- Data: Training data includes ViDoRe/MTEB corpora and synthetic VisRAG assets.
Acknowledgement
We gratefully acknowledge the support of Tomoro AI, a leading AI engineering firm dedicated to delivering high-quality enterprise solutions that accelerate complex R&D and business transformation. This work is directly applied to enhance Tomoroβs customized multimodal agentic RAG pipelines, empowering the autonomous agents to parse, reason over, and retrieve from large-scale enterprise internal documentation. By bridging the gap between vision and language, this model supports Tomoro AI's mission to accelerate the delivery of high-quality enterprise multimodal solutions and deploy robust, production-grade intelligence across high-stakes industries.
π Citation
If you use this model, please cite:
@misc{huang2025tomoro_colqwen3_embed,
title = {TomoroAI/tomoro-colqwen3-embed},
author = {Xin Huang and Kye Min Tan and Albert Phelps},
year = {2025},
url = {https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b}
}
- Downloads last month
- 148
Model tree for TomoroAI/tomoro-colqwen3-embed-8b
Base model
Qwen/Qwen3-VL-8B-Instruct