🌟 AraBERT Matryoshka Embeddings

High-Quality Arabic Sentence Embeddings with Flexible Dimensions

🎯 Model Overview

This model enhances AraBERTv02 with Matryoshka Representation Learning and LoRA adaptation to generate superior Arabic sentence embeddings. It supports multiple embedding dimensions (8, 64, 128, 256) from a single model, offering flexibility between performance and efficiency.

✨ Key Features

Feature	Description
🔄 Multi-Dimensional	Single model supports 4 different embedding sizes (8, 64, 128, 256)
🚀 High Performance	Outperforms base AraBERT across all dimensions
📊 Arabic NLI Optimized	Trained specifically on Arabic Natural Language Inference
⚡ Efficient Inference	Smaller dimensions for faster processing
🎯 Triplet Loss Training	Enhanced semantic understanding through triplet learning

🛠️ Quick Start

Installation

pip install git+https://github.com/Abdalrahman54/matryoshka-wrapper.git

Basic Usage

from matryoshka_wrapper import load_model, MatryoshkaWrapper
from torch.nn.functional import cosine_similarity

# Load model with desired dimension
repo_name = "Abdalrahmankamel/matryoshka-arabert"
model, tokenizer = load_model(repo_name, dim="256")

# Example texts
text1 = "هذا المنتج كان مخيبًا للآمال."
text2 = "هذه البضاعة رائعة!"

# Generate embeddings
emb1 = model.get_embedding(text1, tokenizer, dim="256").squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim="256").squeeze()

# Calculate similarity
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"🔍 Cosine Similarity: {similarity:.4f}")

📊 Performance Comparison

Triplet Example

# Triplet data example
anchor = "الطفل يلعب في الحديقة"
positive = "ولد صغير يلهو في البستان"      # Similar sentence
negative = "السيارة تسير في الشارع"         # Different sentence

# Generate embeddings
anchor_emb = model.get_embedding(anchor, tokenizer, dim="256").squeeze()
positive_emb = model.get_embedding(positive, tokenizer, dim="256").squeeze()
negative_emb = model.get_embedding(negative, tokenizer, dim="256").squeeze()

# Calculate similarities
sim_positive = cosine_similarity(anchor_emb.unsqueeze(0), positive_emb.unsqueeze(0)).item()
sim_negative = cosine_similarity(anchor_emb.unsqueeze(0), negative_emb.unsqueeze(0)).item()

print("🔍 Triplet Results:")
print(f"📊 Anchor ↔ Positive: {sim_positive:.4f}")
print(f"📊 Anchor ↔ Negative: {sim_negative:.4f}")
print(f"📈 Margin: {sim_positive - sim_negative:.4f}")

if sim_positive > sim_negative:
    print("✅ Triplet Success!")
else:
    print("❌ Triplet Failed!")

Dimension Comparison

# Compare across all dimensions
text1 = "اطفال يمرحون سوياً بالكرة في المساحات الخضراء"
text2 = "أطفال يلعبون كرة القدم على العشب"

dimensions = [8, 64, 128, 256]

print("🔍 Multi-Dimensional Similarity Comparison")
print("=" * 50)

for dim in dimensions:
    model, tokenizer = load_model(repo_name, dim=str(dim))
    
    emb1 = model.get_embedding(text1, tokenizer, dim=str(dim)).squeeze()
    emb2 = model.get_embedding(text2, tokenizer, dim=str(dim)).squeeze()
    
    similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
    
    print(f"📐 Dim {dim:>3}: Similarity = {similarity:.4f} | Shape = {emb1.shape}")

🎯 Use Cases

✅ Recommended Uses

Semantic Search: Find similar Arabic documents or passages
Information Retrieval: Enhance search systems with semantic understanding
Content Recommendation: Suggest related Arabic content
Document Clustering: Group similar Arabic texts
Question Answering: Retrieve relevant contexts for Arabic QA

❌ Limitations

Non-Arabic Text: Optimized specifically for Arabic language
Classification Tasks: Not directly trained for classification
Generative Tasks: Not suitable for text generation or translation
Dialectal Variations: May underperform on specific Arabic dialects

🏗️ Model Architecture

Component	Details
Base Model	aubmindlab/bert-base-arabertv02
Enhancement	LoRA (Low-Rank Adaptation)
Training Objective	Triplet Loss
Embedding Dimensions	8, 64, 128, 256
Language	Arabic

🔬 Training Details

Training Configuration

Parameter	Value
Loss Function	Triplet Loss
Adaptation Method	LoRA
Precision	FP16 Mixed Precision
Epochs	3
Batch Size	32
Optimizer	AdamW
Hardware	NVIDIA A100
Training Time	~7 hours

Dataset

Source: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
Type: Arabic Natural Language Inference triplets
Structure: Anchor-Positive-Negative sentence pairs

📈 Evaluation Metrics

Cosine Similarity: Primary similarity measure
Mean Average Precision (MAP): Retrieval performance
Recall@K: Top-K retrieval accuracy
Triplet Accuracy: Correct triplet ranking percentage

⚠️ Bias and Limitations

Potential Biases

Inherits biases from AraBERT base model and training datasets
May favor Modern Standard Arabic over dialectal variants
Performance varies across different Arabic domains and regions

Recommendations

Evaluate on target domain before deployment
Use human oversight for critical applications
Test with diverse Arabic text sources

🌍 Environmental Impact

Hardware: NVIDIA A100 GPU
Training Duration: ~7 hours
Carbon Footprint: Calculated using ML CO2 Impact Calculator

📚 Citation

If you use this model in your research, please cite:

@misc{kamel2025arabert-matryoshka,
  author = {Abdalrahman Kamel},
  title = {AraBERT Matryoshka: Multi-Dimensional Arabic Sentence Embeddings with Triplet Loss},
  year = {2025},
  url = {https://huggingface.co/Abdalrahmankamel/matryoshka-arabert},
  note = {Hugging Face Model Repository}
}

🙏 Acknowledgments

AraBERT Team: For the excellent base model (aubmindlab/bert-base-arabertv02)
Sentence Transformers: For the robust training framework
Matryoshka Representation Learning: For the innovative nested embedding approach
Arabic NLI Dataset: Omartificial-Intelligence-Space for the training data

📄 License

This model is released under the Apache 2.0 License.

Developed by Abdalrahman Kamel

Advancing Arabic NLP through innovative embedding techniques

Abdalrahmankamel
/

matryoshka-arabert