Safetensors
idefics3

Flantier-SmolVLM-2B-dse

A lightweight multimodal vision-language model specialized for technical document retrieval.

Overview

Flantier-SmolVLM-2B-dse (Document Screenshot Embedding) is a 2B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.

Key Features

  • Efficient Retrieval: Generates document and query embeddings for semantic similarity search
  • Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout
  • Lightweight Architecture: Only 2B parameters, runs on consumer GPUs
  • No Preprocessing Required: Directly works with document screenshots

Installation

pip install transformers accelerate pillow

Usage Example

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model and processor
processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-2B-dse")
model = AutoModelForVision2Seq.from_pretrained(
    "racineai/Flantier-SmolVLM-2B-dse",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load document image
document_image = Image.open("technical_document.jpg")

# Process for document embedding
doc_messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    },
]
doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)

# Generate document embedding
with torch.no_grad():
    doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
    doc_embedding = doc_outputs.hidden_states[-1][:, -1]  # Last token embedding
    doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)

# Process query embedding
query = "What are the specifications of this component?"
query_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": query}
        ]
    },
]
query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)

# Generate query embedding
with torch.no_grad():
    query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
    query_embedding = query_outputs.hidden_states[-1][:, -1]  # Last token embedding
    query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)

# Calculate similarity
similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")

Applications

  • Technical Document Retrieval: Find relevant documents based on technical queries
  • Technical Support Systems: Match user questions to relevant documentation
  • Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports

Training Methodology

This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.

Citation

@misc{flantier-smolvlm-dse,
  author = {racine.ai},
  title = {Flantier-SmolVLM-2B-dse: A Lightweight Document Screenshot Embedding Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/racineai/Flantier-SmolVLM-2B-dse}
}

License

This model is released under the Apache 2.0 license.

Downloads last month
3
Safetensors
Model size
2.25B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for racineai/Flantier-SmolVLM-2B-dse

Dataset used to train racineai/Flantier-SmolVLM-2B-dse