|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- information-retrieval |
|
- semantic-search |
|
widget: |
|
- source_sentence: >- |
|
Descrivi dettagliatamente il processo chimico e fisico che avviene durante |
|
la preparazione di un impasto per crostata |
|
sentences: |
|
- >- |
|
## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli |
|
Ingredienti Secchi al Trionfo del Forno |
|
|
|
|
|
La preparazione di una crostata, apparentemente un gesto semplice e |
|
familiare, cela in realtà un affascinante balletto di reazioni chimiche e |
|
trasformazioni fisiche... |
|
- >- |
|
## L'Arte Effimera: Creare un Dolce Paesaggio Invernale |
|
|
|
|
|
Immergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte |
|
culinaria si fonde con la creatività artistica... |
|
- >- |
|
Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si |
|
configurano come un'arma a doppio taglio nel panorama sociale |
|
contemporaneo... |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
language: |
|
- it |
|
license: apache-2.0 |
|
--- |
|
|
|
<p align="center"> |
|
<img src="benchmark.png" style="max-width: 1024px; width: 100%; height: auto;"/> |
|
</p> |
|
<h1 style="font-size: 48px; text-align: center;">Ita-Search 🇮🇹</h1> |
|
|
|
# Fine-tuned Qwen3-Embedding for Italian Semantic Retrieval |
|
|
|
This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for Italian semantic retrieval tasks, with particular emphasis on Italian query understanding and document ranking. |
|
|
|
## Model Description |
|
|
|
- **Model Type**: Dense embedding model for semantic retrieval |
|
- **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) |
|
- **Output Dimensionality**: 1,024-dimensional dense vectors |
|
- **Maximum Sequence Length**: 32,768 tokens |
|
- **Primary Language**: Italian |
|
- **Similarity Function**: Cosine similarity |
|
|
|
## Capabilities |
|
|
|
### Italian Semantic Retrieval |
|
The model demonstrates strong performance in matching Italian queries to Italian documents, particularly effective in technical and academic domains within the Italian language context. |
|
|
|
### Domain Coverage |
|
Trained on diverse Italian knowledge domains including: |
|
- **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology |
|
- **STEM Fields**: Physics, computer science, geology, engineering |
|
- **Professional Domains**: Finance, law, agriculture, software development |
|
- **Educational Content**: Historical studies, culinary arts, general knowledge |
|
|
|
### Query Understanding |
|
Enhanced comprehension of: |
|
- Conversational and informal Italian query patterns |
|
- Technical terminology in Italian across domains |
|
- Italian semantic concepts and nuances |
|
- Complex multi-faceted questions in Italian |
|
|
|
## Training Data |
|
|
|
The model was fine-tuned on a curated corpus of Italian semantic data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes: |
|
|
|
- **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents |
|
- **Italian language focus**: Comprehensive representation of Italian language patterns |
|
- **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts in Italian |
|
- **Quality curation**: Manual review and automated filtering for coherence and relevance |
|
|
|
## Usage |
|
|
|
### Basic Retrieval |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("DeepMount00/Ita-Search") |
|
|
|
# Italian query-document matching |
|
query = "Come si distingue una faglia trascorrente da una normale?" |
|
documents = [ |
|
"Le faglie trascorrenti sono caratterizzate da movimento orizzontale...", |
|
"Le faglie normali si verificano a causa di stress estensionale...", |
|
"Le strategie di gestione del portafoglio di investimenti..." |
|
] |
|
|
|
query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ") |
|
doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ") |
|
similarities = model.similarity(query_embedding, doc_embeddings) |
|
``` |
|
|
|
### Prompt Templates |
|
The model is optimized for specific prompt templates: |
|
- **Queries**: `"Represent this search query for finding relevant passages: "` |
|
- **Documents**: `"Represent this passage for retrieval: "` |
|
|
|
## Applications |
|
|
|
- **Italian information retrieval systems** |
|
- **Academic and technical document search in Italian** |
|
- **Italian question-answering platforms** |
|
- **Educational content recommendation for Italian speakers** |
|
- **Professional knowledge base systems in Italian** |
|
|
|
## Limitations |
|
|
|
- **Language coverage**: Specifically optimized for Italian language |
|
- **Domain specificity**: Performance may vary on highly specialized domains not represented in training |
|
|
|
|
|
## Acknowledgments |
|
|
|
This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community. |
|
|
|
--- |
|
|
|
**License**: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model. |