🌟 AraBERT Matryoshka Embeddings

High-Quality Arabic Sentence Embeddings with Flexible Dimensions

Model License Language


🎯 Model Overview

This model enhances AraBERTv02 with Matryoshka Representation Learning and LoRA adaptation to generate superior Arabic sentence embeddings. It supports multiple embedding dimensions (8, 64, 128, 256) from a single model, offering flexibility between performance and efficiency.

✨ Key Features

Feature Description
🔄 Multi-Dimensional Single model supports 4 different embedding sizes (8, 64, 128, 256)
🚀 High Performance Outperforms base AraBERT across all dimensions
📊 Arabic NLI Optimized Trained specifically on Arabic Natural Language Inference
Efficient Inference Smaller dimensions for faster processing
🎯 Triplet Loss Training Enhanced semantic understanding through triplet learning

🛠️ Quick Start

Installation

pip install git+https://github.com/Abdalrahman54/matryoshka-wrapper.git

Basic Usage

from matryoshka_wrapper import load_model, MatryoshkaWrapper
from torch.nn.functional import cosine_similarity

# Load model with desired dimension
repo_name = "Abdalrahmankamel/matryoshka-arabert"
model, tokenizer = load_model(repo_name, dim="256")

# Example texts
text1 = "هذا المنتج كان مخيبًا للآمال."
text2 = "هذه البضاعة رائعة!"

# Generate embeddings
emb1 = model.get_embedding(text1, tokenizer, dim="256").squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim="256").squeeze()

# Calculate similarity
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"🔍 Cosine Similarity: {similarity:.4f}")

📊 Performance Comparison

Triplet Example

# Triplet data example
anchor = "الطفل يلعب في الحديقة"
positive = "ولد صغير يلهو في البستان"      # Similar sentence
negative = "السيارة تسير في الشارع"         # Different sentence

# Generate embeddings
anchor_emb = model.get_embedding(anchor, tokenizer, dim="256").squeeze()
positive_emb = model.get_embedding(positive, tokenizer, dim="256").squeeze()
negative_emb = model.get_embedding(negative, tokenizer, dim="256").squeeze()

# Calculate similarities
sim_positive = cosine_similarity(anchor_emb.unsqueeze(0), positive_emb.unsqueeze(0)).item()
sim_negative = cosine_similarity(anchor_emb.unsqueeze(0), negative_emb.unsqueeze(0)).item()

print("🔍 Triplet Results:")
print(f"📊 Anchor ↔ Positive: {sim_positive:.4f}")
print(f"📊 Anchor ↔ Negative: {sim_negative:.4f}")
print(f"📈 Margin: {sim_positive - sim_negative:.4f}")

if sim_positive > sim_negative:
    print("✅ Triplet Success!")
else:
    print("❌ Triplet Failed!")

Dimension Comparison

# Compare across all dimensions
text1 = "اطفال يمرحون سوياً بالكرة في المساحات الخضراء"
text2 = "أطفال يلعبون كرة القدم على العشب"

dimensions = [8, 64, 128, 256]

print("🔍 Multi-Dimensional Similarity Comparison")
print("=" * 50)

for dim in dimensions:
    model, tokenizer = load_model(repo_name, dim=str(dim))
    
    emb1 = model.get_embedding(text1, tokenizer, dim=str(dim)).squeeze()
    emb2 = model.get_embedding(text2, tokenizer, dim=str(dim)).squeeze()
    
    similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
    
    print(f"📐 Dim {dim:>3}: Similarity = {similarity:.4f} | Shape = {emb1.shape}")

🎯 Use Cases

✅ Recommended Uses

  • Semantic Search: Find similar Arabic documents or passages
  • Information Retrieval: Enhance search systems with semantic understanding
  • Content Recommendation: Suggest related Arabic content
  • Document Clustering: Group similar Arabic texts
  • Question Answering: Retrieve relevant contexts for Arabic QA

❌ Limitations

  • Non-Arabic Text: Optimized specifically for Arabic language
  • Classification Tasks: Not directly trained for classification
  • Generative Tasks: Not suitable for text generation or translation
  • Dialectal Variations: May underperform on specific Arabic dialects

🏗️ Model Architecture

Component Details
Base Model aubmindlab/bert-base-arabertv02
Enhancement LoRA (Low-Rank Adaptation)
Training Objective Triplet Loss
Embedding Dimensions 8, 64, 128, 256
Language Arabic

🔬 Training Details

Training Configuration

Parameter Value
Loss Function Triplet Loss
Adaptation Method LoRA
Precision FP16 Mixed Precision
Epochs 3
Batch Size 32
Optimizer AdamW
Hardware NVIDIA A100
Training Time ~7 hours

Dataset


📈 Evaluation Metrics

  • Cosine Similarity: Primary similarity measure
  • Mean Average Precision (MAP): Retrieval performance
  • Recall@K: Top-K retrieval accuracy
  • Triplet Accuracy: Correct triplet ranking percentage

⚠️ Bias and Limitations

Potential Biases

  • Inherits biases from AraBERT base model and training datasets
  • May favor Modern Standard Arabic over dialectal variants
  • Performance varies across different Arabic domains and regions

Recommendations

  • Evaluate on target domain before deployment
  • Use human oversight for critical applications
  • Test with diverse Arabic text sources

🌍 Environmental Impact


📚 Citation

If you use this model in your research, please cite:

@misc{kamel2025arabert-matryoshka,
  author = {Abdalrahman Kamel},
  title = {AraBERT Matryoshka: Multi-Dimensional Arabic Sentence Embeddings with Triplet Loss},
  year = {2025},
  url = {https://huggingface.co/Abdalrahmankamel/matryoshka-arabert},
  note = {Hugging Face Model Repository}
}

🙏 Acknowledgments


📄 License

This model is released under the Apache 2.0 License.


Developed by Abdalrahman Kamel

Advancing Arabic NLP through innovative embedding techniques

Downloads last month
163
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abdalrahmankamel/matryoshka-arabert

Adapter
(3)
this model

Dataset used to train Abdalrahmankamel/matryoshka-arabert