Ita-Search 🇮🇹

Fine-tuned Qwen3-Embedding for Italian Semantic Retrieval

This model is a specialized fine-tuned version of Qwen/Qwen3-Embedding-0.6B optimized for Italian semantic retrieval tasks, with particular emphasis on Italian query understanding and document ranking.

Model Description

  • Model Type: Dense embedding model for semantic retrieval
  • Base Model: Qwen/Qwen3-Embedding-0.6B
  • Output Dimensionality: 1,024-dimensional dense vectors
  • Maximum Sequence Length: 32,768 tokens
  • Primary Language: Italian
  • Similarity Function: Cosine similarity

Capabilities

Italian Semantic Retrieval

The model demonstrates strong performance in matching Italian queries to Italian documents, particularly effective in technical and academic domains within the Italian language context.

Domain Coverage

Trained on diverse Italian knowledge domains including:

  • Medical & Health Sciences: Diagnostic imaging, clinical procedures, medical terminology
  • STEM Fields: Physics, computer science, geology, engineering
  • Professional Domains: Finance, law, agriculture, software development
  • Educational Content: Historical studies, culinary arts, general knowledge

Query Understanding

Enhanced comprehension of:

  • Conversational and informal Italian query patterns
  • Technical terminology in Italian across domains
  • Italian semantic concepts and nuances
  • Complex multi-faceted questions in Italian

Training Data

The model was fine-tuned on a curated corpus of Italian semantic data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:

  • Hard negative mining: Strategic inclusion of semantically related but incorrect documents
  • Italian language focus: Comprehensive representation of Italian language patterns
  • Domain diversity: Comprehensive coverage of academic, professional, and conversational contexts in Italian
  • Quality curation: Manual review and automated filtering for coherence and relevance

Usage

Basic Retrieval

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("DeepMount00/Ita-Search")

# Italian query-document matching
query = "Come si distingue una faglia trascorrente da una normale?"
documents = [
    "Le faglie trascorrenti sono caratterizzate da movimento orizzontale...",
    "Le faglie normali si verificano a causa di stress estensionale...",
    "Le strategie di gestione del portafoglio di investimenti..."
]

query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ")
similarities = model.similarity(query_embedding, doc_embeddings)

Prompt Templates

The model is optimized for specific prompt templates:

  • Queries: "Represent this search query for finding relevant passages: "
  • Documents: "Represent this passage for retrieval: "

Applications

  • Italian information retrieval systems
  • Academic and technical document search in Italian
  • Italian question-answering platforms
  • Educational content recommendation for Italian speakers
  • Professional knowledge base systems in Italian

Limitations

  • Language coverage: Specifically optimized for Italian language
  • Domain specificity: Performance may vary on highly specialized domains not represented in training

Acknowledgments

This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community.


License: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model.

Downloads last month
37
Safetensors
Model size
596M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including DeepMount00/Ita-Search