Turkuaz-Embeddings

Overview
Turkuaz-Embeddings is a novel Turkish embedding model specifically optimized for information retrieval tasks. While multilingual embedding models aim to cover a wide range of languages, they often struggle with low-resource languages like Turkish. Turkuaz-Embeddings addresses these limitations, achieving significant improvements in retrieval performance across multiple Turkish benchmarks.
With the rise of Retrieval-Augmented Generation (RAG) systems, the need for reliable, language-specific embeddings has become critical. Turkuaz-Embeddings enhances retrieval reliability and accuracy, making it a valuable component for RAG pipelines and other downstream semantic search tasks involving Turkish content.
Highlights
- Outperforms widely used multilingual models with certain Turkish benchmarks, and Turkish embedding models by up to 20% and on average 9% on Turkish retrieval benchmarks.
- Achieves up to 35% and on average 20% improvement compared to its baseline architecture XLM-RoBERTa Large.
- Demonstrates robust zero-shot retrieval capabilities, indicating strong generalization.
- Specifically designed for Turkish, addressing the unique challenges of an agglutinative language.
- Embedding Size: 1024
Model Training Process
Turkuaz-Embeddings was developed based on the XLM-RoBERTa architecture and fine-tuned using the MSMARCO-TR dataset. The training pipeline involved multiple innovative steps:
Training Data:
- Document pairs with reranker-generated scores
Reranker for Scoring:
- Jina Reranker v2, a cross-encoder model, was used to generate more nuanced relevance scores for document pairs.
Loss Function:
- Trained with CoSENT (Cosine Sentence) Loss from the Sentence Transformers library, optimizing cosine similarity between sentence embeddings.
Evaluation During Training:
- Used the Embedding Similarity Evaluator from Sentence Transformers to assess embedding quality.
Hardware and Training Setup:
- Trained on 2 million document pairs.
- Parallel distributed training over 4x Nvidia A100 80GB GPUs.
- Learning rate: 5e-5 (initial phase), 5e-6 (later phases).
- Warm-up ratios: 0.1 initially, 0.01 for subsequent phases.
Evaluation Results
π Leveraged Models
Model | #Lang | #Param | Base Model | Max. Token | Embed. Size | Latency |
---|---|---|---|---|---|---|
Jina Emb. | 100 | 572M | XLM RoBERTa | 8192 | 1024 | 0.48s |
E5 Ins. | 100 | 560M | XLM RoBERTa | 512 | 1024 | 0.11s |
Paraph. MPNET | 50 | 278M | XLM RoBERTa | 128 | 768 | 0.02s |
Distiluse | 50 | 135M | DistilBERT | 128 | 512 | 0.01s |
XLM RoBERTa | 100 | 560M | RoBERTa | 512 | 1024 | 0.11s |
LaBSE | 109 | 471M | BERT | 256 | 768 | 0.02s |
Proposed Model | 1 | 560M | XLM RoBERTa | 512 | 1024 | 0.11s |
BERTurk | 1 | 110M | BERT | 512 | 768 | 0.03s |
BERT NLI TR | 1 | 110M | BERT | 512 | 768 | 0.01s |
Turkish ColBERT | 1 | 111M | BERT | 512 | 768 | 0.04s |
π Leveraged Datasets
Dataset | Sample Size | NLP Task |
---|---|---|
MSMARCO-TR | 1M | Question Answering |
Wiki-RAG TR | 6000 | RAG |
Neural Bridge TR | 12000 | RAG |
Turkish Historic Question | 15500 | Question Answering |
Google XQuAD β TR | 1200 | Question Answering |
SQuAD-TR | 61300 | Question Answering |
π Retrieval Performance Comparison with Turkish Models
Dataset (Sample Size) | Retrieval | Precision | Recall |
---|---|---|---|
WikiRAG TR (6000) | BERTurk | 0.1765 | 0.1764 |
BERT Base NLI TR | 0.2149 | 0.2145 | |
Proposed Model | 0.2580 | 0.2577 | |
ColBERT | 0.1807 | 0.1805 | |
Neural Bridge TR (9600) | BERTurk | 0.4626 | 0.4626 |
BERT Base NLI TR | 0.6014 | 0.6014 | |
Proposed Model | 0.8286 | 0.8286 | |
ColBERT | 0.5848 | 0.5848 | |
TR Historic Question (15.5k) | BERTurk | 0.4384 | 0.4384 |
BERT Base NLI TR | 0.5121 | 0.5121 | |
Proposed Model | 0.5873 | 0.5873 | |
ColBERT | 0.4656 | 0.4656 | |
Google XQuAD TR (1.2k) | BERTurk | 0.6387 | 0.6387 |
BERT Base NLI TR | 0.7109 | 0.7109 | |
Proposed Model | 0.7807 | 0.7807 | |
ColBERT | 0.7361 | 0.7361 | |
SQuAD-TR (5000) | BERTurk | 0.2834 | 0.2834 |
BERT Base NLI TR | 0.3302 | 0.3302 | |
Proposed Model | 0.4140 | 0.4140 | |
ColBERT | 0.3334 | 0.3334 |
π Table 7: Retrieval Performance Comparison with Multilingual Models
Dataset (Sample Size) | Retrieval | Precision | Recall |
---|---|---|---|
MSMARCO TR (2000) | XLM RoBERTa | 0.0380 | 0.0380 |
E5 Instruct | 0.1531 | 0.1526 | |
Jina Embedding | 0.1618 | 0.1618 | |
Proposed Model | 0.1722 | 0.1722 | |
WikiRAG TR (6000) | E5 Instruct | 0.2814 | 0.2812 |
Jina Embedding | 0.2694 | 0.2692 | |
Proposed Model | 0.2580 | 0.2577 | |
Paraphrase MPNET | 0.2380 | 0.2377 | |
Distiluse Base | 0.2125 | 0.2124 | |
LaBSE | 0.1994 | 0.1990 | |
XLM RoBERTa | 0.1544 | 0.1542 | |
Neural Bridge TR (9600) | E5 Instruct | 0.8085 | 0.8085 |
Jina Embedding | 0.7969 | 0.7969 | |
Proposed Model | 0.8286 | 0.8286 | |
Paraphrase MPNET | 0.6903 | 0.6903 | |
Distiluse Base | 0.6179 | 0.6179 | |
LaBSE | 0.6593 | 0.6593 | |
XLM RoBERTa | 0.4518 | 0.4518 | |
TR Historic Question (15.5k) | E5 Instruct | 0.6344 | 0.6344 |
Jina Embedding | 0.6188 | 0.6188 | |
Proposed Model | 0.5873 | 0.5873 | |
Paraphrase MPNET | 0.4739 | 0.4739 | |
Distiluse Base | 0.4707 | 0.4707 | |
LaBSE | 0.4667 | 0.4667 | |
XLM RoBERTa | 0.3639 | 0.3639 | |
Google XQuAD TR (1.2k) | E5 Instruct | 0.8437 | 0.8437 |
Jina Embedding | 0.8160 | 0.8160 | |
Proposed Model | 0.7807 | 0.7807 | |
Paraphrase MPNET | 0.7571 | 0.7571 | |
Distiluse Base | 0.7277 | 0.7277 | |
LaBSE | 0.7538 | 0.7538 | |
XLM RoBERTa | 0.3975 | 0.3975 | |
SQuAD-TR (5000) | E5 Instruct | 0.5082 | 0.5082 |
Jina Embedding | 0.4706 | 0.4706 | |
Proposed Model | 0.4140 | 0.4140 | |
Paraphrase MPNET | 0.3820 | 0.3820 | |
Distiluse Base | 0.3620 | 0.3620 | |
LaBSE | 0.3594 | 0.3594 | |
XLM RoBERTa | 0.2078 | 0.2078 |
Intended Use
- Semantic Search
- Retrieval-Augmented Generation (RAG) Systems
- Zero-shot Retrieval
- Turkish NLP Research and Applications
Sample Usage
from sentence_transformers import SentenceTransformer
# Load the model
device = "cuda" # or "cpu"
model = SentenceTransformer("eneSadi/turkuaz-embeddings", device=device)
# Encode sentences
sentences = ["Ben bir ceviz aΔacΔ±yΔ±m GΓΌlhane parkΔ±nda.", "Herkes gΓΆmlek giyerken, Ahmet ceket giyerdi."]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 1024)
Keywords
Embedding Models, Information Retrieval, Semantic Search, Retrieval-Augmented Generation (RAG), Turkish NLP
Citation
If you use Turkuaz-Embeddings in your work, please consider citing it appropriately (citation format to be provided later).
License
Apache 2.0
Acknowledgements
This work leveraged the MSMARCO-TR dataset and builds upon the Sentence Transformers library and Jina Reranker v2 model.
- Downloads last month
- 13
Model tree for eneSadi/turkuaz-embeddings
Base model
FacebookAI/xlm-roberta-large