Turkuaz-Embeddings

Overview

Turkuaz-Embeddings is a novel Turkish embedding model specifically optimized for information retrieval tasks. While multilingual embedding models aim to cover a wide range of languages, they often struggle with low-resource languages like Turkish. Turkuaz-Embeddings addresses these limitations, achieving significant improvements in retrieval performance across multiple Turkish benchmarks.

With the rise of Retrieval-Augmented Generation (RAG) systems, the need for reliable, language-specific embeddings has become critical. Turkuaz-Embeddings enhances retrieval reliability and accuracy, making it a valuable component for RAG pipelines and other downstream semantic search tasks involving Turkish content.

Highlights

Outperforms widely used multilingual models with certain Turkish benchmarks, and Turkish embedding models by up to 20% and on average 9% on Turkish retrieval benchmarks.
Achieves up to 35% and on average 20% improvement compared to its baseline architecture XLM-RoBERTa Large.
Demonstrates robust zero-shot retrieval capabilities, indicating strong generalization.
Specifically designed for Turkish, addressing the unique challenges of an agglutinative language.
Embedding Size: 1024

Model Training Process

Turkuaz-Embeddings was developed based on the XLM-RoBERTa architecture and fine-tuned using the MSMARCO-TR dataset. The training pipeline involved multiple innovative steps:

Training Data:
- Document pairs with reranker-generated scores
Reranker for Scoring:
- Jina Reranker v2, a cross-encoder model, was used to generate more nuanced relevance scores for document pairs.
Loss Function:
- Trained with CoSENT (Cosine Sentence) Loss from the Sentence Transformers library, optimizing cosine similarity between sentence embeddings.
Evaluation During Training:
- Used the Embedding Similarity Evaluator from Sentence Transformers to assess embedding quality.
Hardware and Training Setup:
- Trained on 2 million document pairs.
- Parallel distributed training over 4x Nvidia A100 80GB GPUs.
- Learning rate: 5e-5 (initial phase), 5e-6 (later phases).
- Warm-up ratios: 0.1 initially, 0.01 for subsequent phases.

Evaluation Results

📊 Leveraged Models

Model	#Lang	#Param	Base Model	Max. Token	Embed. Size	Latency
Jina Emb.	100	572M	XLM RoBERTa	8192	1024	0.48s
E5 Ins.	100	560M	XLM RoBERTa	512	1024	0.11s
Paraph. MPNET	50	278M	XLM RoBERTa	128	768	0.02s
Distiluse	50	135M	DistilBERT	128	512	0.01s
XLM RoBERTa	100	560M	RoBERTa	512	1024	0.11s
LaBSE	109	471M	BERT	256	768	0.02s
Proposed Model	1	560M	XLM RoBERTa	512	1024	0.11s
BERTurk	1	110M	BERT	512	768	0.03s
BERT NLI TR	1	110M	BERT	512	768	0.01s
Turkish ColBERT	1	111M	BERT	512	768	0.04s

📊 Leveraged Datasets

Dataset	Sample Size	NLP Task
MSMARCO-TR	1M	Question Answering
Wiki-RAG TR	6000	RAG
Neural Bridge TR	12000	RAG
Turkish Historic Question	15500	Question Answering
Google XQuAD – TR	1200	Question Answering
SQuAD-TR	61300	Question Answering

📊 Retrieval Performance Comparison with Turkish Models

Dataset (Sample Size)	Retrieval	Precision	Recall
WikiRAG TR (6000)	BERTurk	0.1765	0.1764
	BERT Base NLI TR	0.2149	0.2145
	Proposed Model	0.2580	0.2577
	ColBERT	0.1807	0.1805
Neural Bridge TR (9600)	BERTurk	0.4626	0.4626
	BERT Base NLI TR	0.6014	0.6014
	Proposed Model	0.8286	0.8286
	ColBERT	0.5848	0.5848
TR Historic Question (15.5k)	BERTurk	0.4384	0.4384
	BERT Base NLI TR	0.5121	0.5121
	Proposed Model	0.5873	0.5873
	ColBERT	0.4656	0.4656
Google XQuAD TR (1.2k)	BERTurk	0.6387	0.6387
	BERT Base NLI TR	0.7109	0.7109
	Proposed Model	0.7807	0.7807
	ColBERT	0.7361	0.7361
SQuAD-TR (5000)	BERTurk	0.2834	0.2834
	BERT Base NLI TR	0.3302	0.3302
	Proposed Model	0.4140	0.4140
	ColBERT	0.3334	0.3334

📊 Table 7: Retrieval Performance Comparison with Multilingual Models

Dataset (Sample Size)	Retrieval	Precision	Recall
MSMARCO TR (2000)	XLM RoBERTa	0.0380	0.0380
	E5 Instruct	0.1531	0.1526
	Jina Embedding	0.1618	0.1618
	Proposed Model	0.1722	0.1722
WikiRAG TR (6000)	E5 Instruct	0.2814	0.2812
	Jina Embedding	0.2694	0.2692
	Proposed Model	0.2580	0.2577
	Paraphrase MPNET	0.2380	0.2377
	Distiluse Base	0.2125	0.2124
	LaBSE	0.1994	0.1990
	XLM RoBERTa	0.1544	0.1542
Neural Bridge TR (9600)	E5 Instruct	0.8085	0.8085
	Jina Embedding	0.7969	0.7969
	Proposed Model	0.8286	0.8286
	Paraphrase MPNET	0.6903	0.6903
	Distiluse Base	0.6179	0.6179
	LaBSE	0.6593	0.6593
	XLM RoBERTa	0.4518	0.4518
TR Historic Question (15.5k)	E5 Instruct	0.6344	0.6344
	Jina Embedding	0.6188	0.6188
	Proposed Model	0.5873	0.5873
	Paraphrase MPNET	0.4739	0.4739
	Distiluse Base	0.4707	0.4707
	LaBSE	0.4667	0.4667
	XLM RoBERTa	0.3639	0.3639
Google XQuAD TR (1.2k)	E5 Instruct	0.8437	0.8437
	Jina Embedding	0.8160	0.8160
	Proposed Model	0.7807	0.7807
	Paraphrase MPNET	0.7571	0.7571
	Distiluse Base	0.7277	0.7277
	LaBSE	0.7538	0.7538
	XLM RoBERTa	0.3975	0.3975
SQuAD-TR (5000)	E5 Instruct	0.5082	0.5082
	Jina Embedding	0.4706	0.4706
	Proposed Model	0.4140	0.4140
	Paraphrase MPNET	0.3820	0.3820
	Distiluse Base	0.3620	0.3620
	LaBSE	0.3594	0.3594
	XLM RoBERTa	0.2078	0.2078

Intended Use

Semantic Search
Retrieval-Augmented Generation (RAG) Systems
Zero-shot Retrieval
Turkish NLP Research and Applications

Sample Usage

from sentence_transformers import SentenceTransformer

# Load the model
device = "cuda"  # or "cpu"
model = SentenceTransformer("eneSadi/turkuaz-embeddings", device=device)

# Encode sentences
sentences = ["Ben bir ceviz ağacıyım Gülhane parkında.", "Herkes gömlek giyerken, Ahmet ceket giyerdi."]
embeddings = model.encode(sentences)

print(embeddings.shape)  # (2, 1024)

Keywords

Embedding Models, Information Retrieval, Semantic Search, Retrieval-Augmented Generation (RAG), Turkish NLP

Citation

If you use Turkuaz-Embeddings in your work, please consider citing it appropriately (citation format to be provided later).

License

Apache 2.0

Acknowledgements

This work leveraged the MSMARCO-TR dataset and builds upon the Sentence Transformers library and Jina Reranker v2 model.

eneSadi
/

turkuaz-embeddings