🎯 Model Overview
This model enhances AraBERTv02 with Matryoshka Representation Learning and LoRA adaptation to generate superior Arabic sentence embeddings. It supports multiple embedding dimensions (8, 64, 128, 256) from a single model, offering flexibility between performance and efficiency.
✨ Key Features
Feature | Description |
---|---|
🔄 Multi-Dimensional | Single model supports 4 different embedding sizes (8, 64, 128, 256) |
🚀 High Performance | Outperforms base AraBERT across all dimensions |
📊 Arabic NLI Optimized | Trained specifically on Arabic Natural Language Inference |
⚡ Efficient Inference | Smaller dimensions for faster processing |
🎯 Triplet Loss Training | Enhanced semantic understanding through triplet learning |
🛠️ Quick Start
Installation
pip install git+https://github.com/Abdalrahman54/matryoshka-wrapper.git
Basic Usage
from matryoshka_wrapper import load_model, MatryoshkaWrapper
from torch.nn.functional import cosine_similarity
# Load model with desired dimension
repo_name = "Abdalrahmankamel/matryoshka-arabert"
model, tokenizer = load_model(repo_name, dim="256")
# Example texts
text1 = "هذا المنتج كان مخيبًا للآمال."
text2 = "هذه البضاعة رائعة!"
# Generate embeddings
emb1 = model.get_embedding(text1, tokenizer, dim="256").squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim="256").squeeze()
# Calculate similarity
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"🔍 Cosine Similarity: {similarity:.4f}")
📊 Performance Comparison
Triplet Example
# Triplet data example
anchor = "الطفل يلعب في الحديقة"
positive = "ولد صغير يلهو في البستان" # Similar sentence
negative = "السيارة تسير في الشارع" # Different sentence
# Generate embeddings
anchor_emb = model.get_embedding(anchor, tokenizer, dim="256").squeeze()
positive_emb = model.get_embedding(positive, tokenizer, dim="256").squeeze()
negative_emb = model.get_embedding(negative, tokenizer, dim="256").squeeze()
# Calculate similarities
sim_positive = cosine_similarity(anchor_emb.unsqueeze(0), positive_emb.unsqueeze(0)).item()
sim_negative = cosine_similarity(anchor_emb.unsqueeze(0), negative_emb.unsqueeze(0)).item()
print("🔍 Triplet Results:")
print(f"📊 Anchor ↔ Positive: {sim_positive:.4f}")
print(f"📊 Anchor ↔ Negative: {sim_negative:.4f}")
print(f"📈 Margin: {sim_positive - sim_negative:.4f}")
if sim_positive > sim_negative:
print("✅ Triplet Success!")
else:
print("❌ Triplet Failed!")
Dimension Comparison
# Compare across all dimensions
text1 = "اطفال يمرحون سوياً بالكرة في المساحات الخضراء"
text2 = "أطفال يلعبون كرة القدم على العشب"
dimensions = [8, 64, 128, 256]
print("🔍 Multi-Dimensional Similarity Comparison")
print("=" * 50)
for dim in dimensions:
model, tokenizer = load_model(repo_name, dim=str(dim))
emb1 = model.get_embedding(text1, tokenizer, dim=str(dim)).squeeze()
emb2 = model.get_embedding(text2, tokenizer, dim=str(dim)).squeeze()
similarity = cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()
print(f"📐 Dim {dim:>3}: Similarity = {similarity:.4f} | Shape = {emb1.shape}")
🎯 Use Cases
✅ Recommended Uses
- Semantic Search: Find similar Arabic documents or passages
- Information Retrieval: Enhance search systems with semantic understanding
- Content Recommendation: Suggest related Arabic content
- Document Clustering: Group similar Arabic texts
- Question Answering: Retrieve relevant contexts for Arabic QA
❌ Limitations
- Non-Arabic Text: Optimized specifically for Arabic language
- Classification Tasks: Not directly trained for classification
- Generative Tasks: Not suitable for text generation or translation
- Dialectal Variations: May underperform on specific Arabic dialects
🏗️ Model Architecture
Component | Details |
---|---|
Base Model | aubmindlab/bert-base-arabertv02 |
Enhancement | LoRA (Low-Rank Adaptation) |
Training Objective | Triplet Loss |
Embedding Dimensions | 8, 64, 128, 256 |
Language | Arabic |
🔬 Training Details
Training Configuration
Parameter | Value |
---|---|
Loss Function | Triplet Loss |
Adaptation Method | LoRA |
Precision | FP16 Mixed Precision |
Epochs | 3 |
Batch Size | 32 |
Optimizer | AdamW |
Hardware | NVIDIA A100 |
Training Time | ~7 hours |
Dataset
- Source: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
- Type: Arabic Natural Language Inference triplets
- Structure: Anchor-Positive-Negative sentence pairs
📈 Evaluation Metrics
- Cosine Similarity: Primary similarity measure
- Mean Average Precision (MAP): Retrieval performance
- Recall@K: Top-K retrieval accuracy
- Triplet Accuracy: Correct triplet ranking percentage
⚠️ Bias and Limitations
Potential Biases
- Inherits biases from AraBERT base model and training datasets
- May favor Modern Standard Arabic over dialectal variants
- Performance varies across different Arabic domains and regions
Recommendations
- Evaluate on target domain before deployment
- Use human oversight for critical applications
- Test with diverse Arabic text sources
🌍 Environmental Impact
- Hardware: NVIDIA A100 GPU
- Training Duration: ~7 hours
- Carbon Footprint: Calculated using ML CO2 Impact Calculator
📚 Citation
If you use this model in your research, please cite:
@misc{kamel2025arabert-matryoshka,
author = {Abdalrahman Kamel},
title = {AraBERT Matryoshka: Multi-Dimensional Arabic Sentence Embeddings with Triplet Loss},
year = {2025},
url = {https://huggingface.co/Abdalrahmankamel/matryoshka-arabert},
note = {Hugging Face Model Repository}
}
🙏 Acknowledgments
- AraBERT Team: For the excellent base model (aubmindlab/bert-base-arabertv02)
- Sentence Transformers: For the robust training framework
- Matryoshka Representation Learning: For the innovative nested embedding approach
- Arabic NLI Dataset: Omartificial-Intelligence-Space for the training data
📄 License
This model is released under the Apache 2.0 License.
Developed by Abdalrahman Kamel
Advancing Arabic NLP through innovative embedding techniques
- Downloads last month
- 163
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Abdalrahmankamel/matryoshka-arabert
Base model
aubmindlab/bert-base-arabertv02