Turkuaz-Embeddings

Resized Image

Overview

Turkuaz-Embeddings is a novel Turkish embedding model specifically optimized for information retrieval tasks. While multilingual embedding models aim to cover a wide range of languages, they often struggle with low-resource languages like Turkish. Turkuaz-Embeddings addresses these limitations, achieving significant improvements in retrieval performance across multiple Turkish benchmarks.

With the rise of Retrieval-Augmented Generation (RAG) systems, the need for reliable, language-specific embeddings has become critical. Turkuaz-Embeddings enhances retrieval reliability and accuracy, making it a valuable component for RAG pipelines and other downstream semantic search tasks involving Turkish content.


Highlights

  • Outperforms widely used multilingual models with certain Turkish benchmarks, and Turkish embedding models by up to 20% and on average 9% on Turkish retrieval benchmarks.
  • Achieves up to 35% and on average 20% improvement compared to its baseline architecture XLM-RoBERTa Large.
  • Demonstrates robust zero-shot retrieval capabilities, indicating strong generalization.
  • Specifically designed for Turkish, addressing the unique challenges of an agglutinative language.
  • Embedding Size: 1024

Model Training Process

Turkuaz-Embeddings was developed based on the XLM-RoBERTa architecture and fine-tuned using the MSMARCO-TR dataset. The training pipeline involved multiple innovative steps:

  • Training Data:

    • Document pairs with reranker-generated scores
  • Reranker for Scoring:

    • Jina Reranker v2, a cross-encoder model, was used to generate more nuanced relevance scores for document pairs.
  • Loss Function:

    • Trained with CoSENT (Cosine Sentence) Loss from the Sentence Transformers library, optimizing cosine similarity between sentence embeddings.
  • Evaluation During Training:

    • Used the Embedding Similarity Evaluator from Sentence Transformers to assess embedding quality.
  • Hardware and Training Setup:

    • Trained on 2 million document pairs.
    • Parallel distributed training over 4x Nvidia A100 80GB GPUs.
    • Learning rate: 5e-5 (initial phase), 5e-6 (later phases).
    • Warm-up ratios: 0.1 initially, 0.01 for subsequent phases.

Evaluation Results

πŸ“Š Leveraged Models

Model #Lang #Param Base Model Max. Token Embed. Size Latency
Jina Emb. 100 572M XLM RoBERTa 8192 1024 0.48s
E5 Ins. 100 560M XLM RoBERTa 512 1024 0.11s
Paraph. MPNET 50 278M XLM RoBERTa 128 768 0.02s
Distiluse 50 135M DistilBERT 128 512 0.01s
XLM RoBERTa 100 560M RoBERTa 512 1024 0.11s
LaBSE 109 471M BERT 256 768 0.02s
Proposed Model 1 560M XLM RoBERTa 512 1024 0.11s
BERTurk 1 110M BERT 512 768 0.03s
BERT NLI TR 1 110M BERT 512 768 0.01s
Turkish ColBERT 1 111M BERT 512 768 0.04s

πŸ“Š Leveraged Datasets

Dataset Sample Size NLP Task
MSMARCO-TR 1M Question Answering
Wiki-RAG TR 6000 RAG
Neural Bridge TR 12000 RAG
Turkish Historic Question 15500 Question Answering
Google XQuAD – TR 1200 Question Answering
SQuAD-TR 61300 Question Answering

πŸ“Š Retrieval Performance Comparison with Turkish Models

Dataset (Sample Size) Retrieval Precision Recall
WikiRAG TR (6000) BERTurk 0.1765 0.1764
BERT Base NLI TR 0.2149 0.2145
Proposed Model 0.2580 0.2577
ColBERT 0.1807 0.1805
Neural Bridge TR (9600) BERTurk 0.4626 0.4626
BERT Base NLI TR 0.6014 0.6014
Proposed Model 0.8286 0.8286
ColBERT 0.5848 0.5848
TR Historic Question (15.5k) BERTurk 0.4384 0.4384
BERT Base NLI TR 0.5121 0.5121
Proposed Model 0.5873 0.5873
ColBERT 0.4656 0.4656
Google XQuAD TR (1.2k) BERTurk 0.6387 0.6387
BERT Base NLI TR 0.7109 0.7109
Proposed Model 0.7807 0.7807
ColBERT 0.7361 0.7361
SQuAD-TR (5000) BERTurk 0.2834 0.2834
BERT Base NLI TR 0.3302 0.3302
Proposed Model 0.4140 0.4140
ColBERT 0.3334 0.3334

image/png

πŸ“Š Table 7: Retrieval Performance Comparison with Multilingual Models

Dataset (Sample Size) Retrieval Precision Recall
MSMARCO TR (2000) XLM RoBERTa 0.0380 0.0380
E5 Instruct 0.1531 0.1526
Jina Embedding 0.1618 0.1618
Proposed Model 0.1722 0.1722
WikiRAG TR (6000) E5 Instruct 0.2814 0.2812
Jina Embedding 0.2694 0.2692
Proposed Model 0.2580 0.2577
Paraphrase MPNET 0.2380 0.2377
Distiluse Base 0.2125 0.2124
LaBSE 0.1994 0.1990
XLM RoBERTa 0.1544 0.1542
Neural Bridge TR (9600) E5 Instruct 0.8085 0.8085
Jina Embedding 0.7969 0.7969
Proposed Model 0.8286 0.8286
Paraphrase MPNET 0.6903 0.6903
Distiluse Base 0.6179 0.6179
LaBSE 0.6593 0.6593
XLM RoBERTa 0.4518 0.4518
TR Historic Question (15.5k) E5 Instruct 0.6344 0.6344
Jina Embedding 0.6188 0.6188
Proposed Model 0.5873 0.5873
Paraphrase MPNET 0.4739 0.4739
Distiluse Base 0.4707 0.4707
LaBSE 0.4667 0.4667
XLM RoBERTa 0.3639 0.3639
Google XQuAD TR (1.2k) E5 Instruct 0.8437 0.8437
Jina Embedding 0.8160 0.8160
Proposed Model 0.7807 0.7807
Paraphrase MPNET 0.7571 0.7571
Distiluse Base 0.7277 0.7277
LaBSE 0.7538 0.7538
XLM RoBERTa 0.3975 0.3975
SQuAD-TR (5000) E5 Instruct 0.5082 0.5082
Jina Embedding 0.4706 0.4706
Proposed Model 0.4140 0.4140
Paraphrase MPNET 0.3820 0.3820
Distiluse Base 0.3620 0.3620
LaBSE 0.3594 0.3594
XLM RoBERTa 0.2078 0.2078

image/png

image/png


Intended Use

  • Semantic Search
  • Retrieval-Augmented Generation (RAG) Systems
  • Zero-shot Retrieval
  • Turkish NLP Research and Applications

Sample Usage

from sentence_transformers import SentenceTransformer

# Load the model
device = "cuda"  # or "cpu"
model = SentenceTransformer("eneSadi/turkuaz-embeddings", device=device)

# Encode sentences
sentences = ["Ben bir ceviz ağacıyım Gülhane parkında.", "Herkes gâmlek giyerken, Ahmet ceket giyerdi."]
embeddings = model.encode(sentences)

print(embeddings.shape)  # (2, 1024)

Keywords

Embedding Models, Information Retrieval, Semantic Search, Retrieval-Augmented Generation (RAG), Turkish NLP


Citation

If you use Turkuaz-Embeddings in your work, please consider citing it appropriately (citation format to be provided later).


License

Apache 2.0


Acknowledgements

This work leveraged the MSMARCO-TR dataset and builds upon the Sentence Transformers library and Jina Reranker v2 model.

Downloads last month
13
Safetensors
Model size
560M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for eneSadi/turkuaz-embeddings

Finetuned
(416)
this model

Dataset used to train eneSadi/turkuaz-embeddings