🌟 AraBERT Matryoshka Embeddings

High-Quality Arabic Sentence Embeddings with Flexible Dimensions

Model License Language


🎯 Model Overview

This model enhances AraBERTv02 with Matryoshka Representation Learning and LoRA adaptation to generate superior Arabic sentence embeddings. It supports multiple embedding dimensions (8, 64, 128, 256) from a single model, offering flexibility between performance and efficiency.

✨ Key Features

Feature Description
🔄 Multi-Dimensional Single model supports 4 different embedding sizes (8, 64, 128, 256)
🚀 High Performance Outperforms base AraBERT across all dimensions
📊 Arabic NLI Optimized Trained specifically on Arabic Natural Language Inference
Efficient Inference Smaller dimensions for faster processing
🎯 Triplet Loss Training Enhanced semantic understanding through triplet learning

🛠️ Quick Start

Installation

pip install git+https://github.com/Abdalrahman54/matryoshka-wrapper.git

Basic Usage

from matryoshka_wrapper import load_model, MatryoshkaWrapper
from torch.nn.functional import cosine_similarity

# Load model with desired dimension
repo_name = "Abdalrahmankamel/matryoshka-arabert"
model, tokenizer = load_model(repo_name, dim="256")

# Example texts
text1 = "هذا المنتج كان مخيبًا للآمال."
text2 = "هذه البضاعة رائعة!"

# Generate embeddings
emb1 = model.get_embedding(text1, tokenizer, dim="256").squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim="256").squeeze()

# Calculate similarity
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"🔍 Cosine Similarity: {similarity:.4f}")

📊 Performance Comparison

Triplet Example

# Triplet data example
anchor = "الطفل يلعب في الحديقة"
positive = "ولد صغير يلهو في البستان"      # Similar sentence
negative = "السيارة تسير في الشارع"         # Different sentence

# Generate embeddings
anchor_emb = model.get_embedding(anchor, tokenizer, dim="256").squeeze()
positive_emb = model.get_embedding(positive, tokenizer, dim="256").squeeze()
negative_emb = model.get_embedding(negative, tokenizer, dim="256").squeeze()

# Calculate similarities
sim_positive = cosine_similarity(anchor_emb.unsqueeze(0), positive_emb.unsqueeze(0)).item()
sim_negative = cosine_similarity(anchor_emb.unsqueeze(0), negative_emb.unsqueeze(0)).item()

print("🔍 Triplet Results:")
print(f"📊 Anchor ↔ Positive: {sim_positive:.4f}")
print(f"📊 Anchor ↔ Negative: {sim_negative:.4f}")
print(f"📈 Margin: {sim_positive - sim_negative:.4f}")

if sim_positive > sim_negative:
    print("✅ Triplet Success!")
else:
    print("❌ Triplet Failed!")

Dimension Comparison

# Compare across all dimensions
text1 = "اطفال يمرحون سوياً بالكرة في المساحات الخضراء"
text2 = "أطفال يلعبون كرة القدم على العشب"

dimensions = [8, 64, 128, 256]

print("🔍 Multi-Dimensional Similarity Comparison")
print("=" * 50)

for dim in dimensions:
    model, tokenizer = load_model(repo_name, dim=str(dim))
    
    emb1 = model.get_embedding(text1, tokenizer, dim=str(dim)).squeeze()
    emb2 = model.get_embedding(text2, tokenizer, dim=str(dim)).squeeze()
    
    similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
    
    print(f"📐 Dim {dim:>3}: Similarity = {similarity:.4f} | Shape = {emb1.shape}")

🎯 Use Cases

✅ Recommended Uses

  • Semantic Search: Find similar Arabic documents or passages
  • Information Retrieval: Enhance search systems with semantic understanding
  • Content Recommendation: Suggest related Arabic content
  • Document Clustering: Group similar Arabic texts
  • Question Answering: Retrieve relevant contexts for Arabic QA

❌ Limitations

  • Non-Arabic Text: Optimized specifically for Arabic language
  • Classification Tasks: Not directly trained for classification
  • Generative Tasks: Not suitable for text generation or translation
  • Dialectal Variations: May underperform on specific Arabic dialects

🏗️ Model Architecture

Component Details
Base Model aubmindlab/bert-base-arabertv02
Enhancement LoRA (Low-Rank Adaptation)
Training Objective Triplet Loss
Embedding Dimensions 8, 64, 128, 256
Language Arabic

🔬 Training Details

Training Configuration

Parameter Value
Loss Function Triplet Loss
Adaptation Method LoRA
Precision FP16 Mixed Precision
Epochs 3
Batch Size 32
Optimizer AdamW
Hardware NVIDIA A100
Training Time ~7 hours

Dataset


📈 Evaluation Metrics

  • Cosine Similarity: Primary similarity measure
  • Mean Average Precision (MAP): Retrieval performance
  • Recall@K: Top-K retrieval accuracy
  • Triplet Accuracy: Correct triplet ranking percentage

⚠️ Bias and Limitations

Potential Biases

  • Inherits biases from AraBERT base model and training datasets
  • May favor Modern Standard Arabic over dialectal variants
  • Performance varies across different Arabic domains and regions

Recommendations

  • Evaluate on target domain before deployment
  • Use human oversight for critical applications
  • Test with diverse Arabic text sources

🌍 Environmental Impact


📚 Citation

If you use this model in your research, please cite:

@misc{kamel2025arabert-matryoshka,
  author = {Abdalrahman Kamel},
  title = {AraBERT Matryoshka: Multi-Dimensional Arabic Sentence Embeddings with Triplet Loss},
  year = {2025},
  url = {https://huggingface.co/Abdalrahmankamel/matryoshka-arabert},
  note = {Hugging Face Model Repository}
}

🙏 Acknowledgments


📄 License

This model is released under the Apache 2.0 License.


Developed by Abdalrahman Kamel

Advancing Arabic NLP through innovative embedding techniques

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abdalrahmankamel/matryoshka-arabert

Adapter
(3)
this model

Dataset used to train Abdalrahmankamel/matryoshka-arabert