Ita-Search / README.md
DeepMount00's picture
Update README.md
af1ce56 verified
---
tags:
- sentence-transformers
- sentence-similarity
- information-retrieval
- semantic-search
widget:
- source_sentence: >-
Descrivi dettagliatamente il processo chimico e fisico che avviene durante
la preparazione di un impasto per crostata
sentences:
- >-
## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli
Ingredienti Secchi al Trionfo del Forno
La preparazione di una crostata, apparentemente un gesto semplice e
familiare, cela in realtà un affascinante balletto di reazioni chimiche e
trasformazioni fisiche...
- >-
## L'Arte Effimera: Creare un Dolce Paesaggio Invernale
Immergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte
culinaria si fonde con la creatività artistica...
- >-
Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si
configurano come un'arma a doppio taglio nel panorama sociale
contemporaneo...
pipeline_tag: sentence-similarity
library_name: sentence-transformers
language:
- it
license: apache-2.0
---
<p align="center">
<img src="benchmark.png" style="max-width: 1024px; width: 100%; height: auto;"/>
</p>
<h1 style="font-size: 48px; text-align: center;">Ita-Search 🇮🇹</h1>
# Fine-tuned Qwen3-Embedding for Italian Semantic Retrieval
This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for Italian semantic retrieval tasks, with particular emphasis on Italian query understanding and document ranking.
## Model Description
- **Model Type**: Dense embedding model for semantic retrieval
- **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
- **Output Dimensionality**: 1,024-dimensional dense vectors
- **Maximum Sequence Length**: 32,768 tokens
- **Primary Language**: Italian
- **Similarity Function**: Cosine similarity
## Capabilities
### Italian Semantic Retrieval
The model demonstrates strong performance in matching Italian queries to Italian documents, particularly effective in technical and academic domains within the Italian language context.
### Domain Coverage
Trained on diverse Italian knowledge domains including:
- **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology
- **STEM Fields**: Physics, computer science, geology, engineering
- **Professional Domains**: Finance, law, agriculture, software development
- **Educational Content**: Historical studies, culinary arts, general knowledge
### Query Understanding
Enhanced comprehension of:
- Conversational and informal Italian query patterns
- Technical terminology in Italian across domains
- Italian semantic concepts and nuances
- Complex multi-faceted questions in Italian
## Training Data
The model was fine-tuned on a curated corpus of Italian semantic data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:
- **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents
- **Italian language focus**: Comprehensive representation of Italian language patterns
- **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts in Italian
- **Quality curation**: Manual review and automated filtering for coherence and relevance
## Usage
### Basic Retrieval
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("DeepMount00/Ita-Search")
# Italian query-document matching
query = "Come si distingue una faglia trascorrente da una normale?"
documents = [
"Le faglie trascorrenti sono caratterizzate da movimento orizzontale...",
"Le faglie normali si verificano a causa di stress estensionale...",
"Le strategie di gestione del portafoglio di investimenti..."
]
query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ")
similarities = model.similarity(query_embedding, doc_embeddings)
```
### Prompt Templates
The model is optimized for specific prompt templates:
- **Queries**: `"Represent this search query for finding relevant passages: "`
- **Documents**: `"Represent this passage for retrieval: "`
## Applications
- **Italian information retrieval systems**
- **Academic and technical document search in Italian**
- **Italian question-answering platforms**
- **Educational content recommendation for Italian speakers**
- **Professional knowledge base systems in Italian**
## Limitations
- **Language coverage**: Specifically optimized for Italian language
- **Domain specificity**: Performance may vary on highly specialized domains not represented in training
## Acknowledgments
This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community.
---
**License**: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model.