Flantier-SmolVLM-2B-dse
A lightweight multimodal vision-language model specialized for technical document retrieval.
Overview
Flantier-SmolVLM-2B-dse (Document Screenshot Embedding) is a 2B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.
Key Features
- Efficient Retrieval: Generates document and query embeddings for semantic similarity search
- Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout
- Lightweight Architecture: Only 2B parameters, runs on consumer GPUs
- No Preprocessing Required: Directly works with document screenshots
Installation
pip install transformers accelerate pillow
Usage Example
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
# Load model and processor
processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-2B-dse")
model = AutoModelForVision2Seq.from_pretrained(
"racineai/Flantier-SmolVLM-2B-dse",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load document image
document_image = Image.open("technical_document.jpg")
# Process for document embedding
doc_messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"}
]
},
]
doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)
# Generate document embedding
with torch.no_grad():
doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding
doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)
# Process query embedding
query = "What are the specifications of this component?"
query_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": query}
]
},
]
query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)
# Generate query embedding
with torch.no_grad():
query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding
query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)
# Calculate similarity
similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")
Applications
- Technical Document Retrieval: Find relevant documents based on technical queries
- Technical Support Systems: Match user questions to relevant documentation
- Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports
Training Methodology
This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.
Citation
@misc{flantier-smolvlm-dse,
author = {racine.ai},
title = {Flantier-SmolVLM-2B-dse: A Lightweight Document Screenshot Embedding Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/racineai/Flantier-SmolVLM-2B-dse}
}
License
This model is released under the Apache 2.0 license.
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.
Model tree for racineai/Flantier-SmolVLM-2B-dse
Base model
HuggingFaceTB/SmolLM2-1.7B
Quantized
HuggingFaceTB/SmolLM2-1.7B-Instruct
Quantized
HuggingFaceTB/SmolVLM-Instruct