SA-Retrieval-Embeddings-0.2B
Saudi Arabic Retrieval-Optimized Sentence Embeddings
This model is a retrieval-optimized SentenceTransformer, fine-tuned from Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B, and specifically designed for:
- Semantic retrieval
- RAG (Retrieval-Augmented Generation)
- Paragraph-level semantic search
- Chunk-based document retrieval
- Saudi Arabic dialect understanding
Unlike general semantic similarity models, this model is explicitly trained to rank the correct semantic chunk at the top, even among closely related alternatives.
๐ What makes this model different?
Most Arabic embedding models are trained on pairwise similarity only.
This model goes further by incorporating:
- Summary โ Chunk retrieval supervision
- Hard negatives from semantic chunk boundaries
- Triplet-based discrimination
- In-batch negatives via MNLR
As a result, it excels in real-world retrieval scenarios, not just sentence similarity.
๐ง Training Overview
- Base Model: SA-STS-Embeddings-0.2B
- Training Objective:
- MultipleNegativesRankingLoss (primary)
- TripletLoss with hard negatives (boundary-based)
- Embedding Dimension: 768
- Pooling Strategy: Mean pooling
- Max Sequence Length: 512 tokens
- Training Samples: 4,038+ supervised retrieval examples
- Precision: FP16
Training Data
The model was trained using Saudi Semantic Chunking data, where:
- Each document is split into 3โ5 semantic chunks
- Each chunk has a human-written summary
- Retrieval task:
summary โ correct chunk among other chunks from the same document
Dataset: ๐ Omartificial-Intelligence-Space/Saudi-Semantic-Chunks
๐ Evaluation Results
The model was evaluated on a hard retrieval benchmark consisting of
1,515 retrieval cases across 24 Saudi domains, using chunk-level negatives.
๐ Leaderboard Comparison
Key Takeaways
- Best Top-1 Accuracy โ correct chunk ranked first ~88% of the time
- Best MRR โ correct chunk appears very early in ranking
- Excellent Recall@5 (99.2%) โ ideal for RAG pipelines
- Highest FinalScore โ best overall balance of retrieval + discourse awareness
๐ Metric Definitions
- Top-1: Correct chunk ranked first
- MRR: Mean Reciprocal Rank
- Recall@k: Correct chunk appears in top-k
- nDCG: Ranking quality with position discount
- Contrast: (Intra-chunk similarity โ Inter-chunk similarity)
- FinalScore: 0.4 ร Top-1 + 0.3 ร MRR + 0.2 ร Contrast + 0.1 ร nDCG
๐งช Usage
Install
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B"
)
sentences = [
"ุฃูุถู ููุช ูุฒูุงุฑุฉ ุงูุนูุง ูู ุงูุดุชุงุก",
"ุงูุนูุง ุชููู ุฃุฌู
ู ูู ุงูุดุชุงุก ูุงูุฌู ู
ุนุชุฏู",
"ุฒุญู
ุฉ ุงูุฑูุงุถ ุงูููู
ุบูุฑ ุทุจูุนูุฉ"
]
embeddings = model.encode(sentences, normalize_embeddings=True)
from sklearn.metrics.pairwise import cosine_similarity
query = "ุฃูุถู ููุช ูุฒูุงุฑุฉ ุฃุจูุง"
chunks = [
"ุฃุจูุง ุชุชู
ูุฒ ุจุฃุฌูุงุก ู
ุนุชุฏูุฉ ูู ุงูุตูู.",
"ุงูุฑูุงุถ ู
ุฏููุฉ ู
ุฒุฏุญู
ุฉ.",
"ู
ุทุงุนู
ุฌุฏุฉ ู
ุชููุนุฉ."
]
q_emb = model.encode(query, normalize_embeddings=True)
c_embs = model.encode(chunks, normalize_embeddings=True)
scores = cosine_similarity([q_emb], c_embs)[0]
for s, c in sorted(zip(scores, chunks), reverse=True):
print(round(s, 3), c)
๐ฏ Intended Use
- RAG systems
- Semantic search engines
- Knowledge base retrieval
- Document chunk retrieval
- Saudi dialect applications
- Government & enterprise search
โ ๏ธ Limitations
- Optimized for Saudi Arabic (dialect + MSA)
- Not trained for cross-lingual retrieval
- Not intended for generative tasks
- Best performance when text is chunked semantically
@misc{sa_retrieval_embeddings_2025,
title = {SA-Retrieval-Embeddings-0.2B: Retrieval-Optimized Saudi Arabic Sentence Embeddings},
author = {Omer Nacar},
year = {2025},
publisher = {HuggingFace}
}
- Downloads last month
- 18
Model tree for Omartificial-Intelligence-Space/SA-Retrieval-Embeddings-0.2B
Base model
UBC-NLP/MARBERTv2