license:
tags:
- modernbert
- masked-lm
- arabic
datasets:
- oscar-corpus/OSCAR-2301
- wikimedia/wikipedia
language:
- ar
- eng
model_type: modernbert
pipeline_tag: fill-mask
library_name: transformers
This model card presents gizadatateam/ModernAraBERT, a masked‑LM based on the ModernBERT architecture, optimized for Arabic with rotary embeddings, and alternating local–global attention.
It was pre‑trained on large‑scale Arabic corpora—e.g., OSCAR‑2301—to support very long contexts (up to 8 192 tokens) for fill‑mask tasks in the Hugging Face Transformers ecosystem.
Model Details
Model Description
Arabic‑ModernBERT is built on the ModernBERT encoder with a Masked Language Modeling objective, delivering robust contextual embeddings for a variety of Arabic NLP tasks.
- Developed by: Giza Data Team (Ahmed Sami, Mohamed ElBehery, Mariam Ashraf, Reem Ayman, Ali Sameh, Ali Badawy)
- Model type:
ModernBertForMaskedLM
- Language(s): Arabic (ar), English ()eng
- License: .....
- Fine‑tuned from: AnswerDotAI/ModernBERT (base)
Model Sources
- Repository:
https://huggingface.co/gizadatateam/ModernAraBERT
- Base paper: Warner et al. (2024), Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference,
Uses
Direct Use
Use this model directly with the fill-mask
pipeline in Transformers for token infilling tasks in Arabic and English.
Downstream Use
- Feature extraction for classification, named‑entity recognition, question answering, and retrieval pipelines.
Out‑of‑Scope Use
- Not intended for generative text beyond masked token prediction or for languages other than Arabic.
Bias, Risks, and Limitations
- Data bias: Trained on web‑scraped OSCAR and Wikipedia; may reflect misinformation or offensive content inherent in these sources.
- Dialectal coverage: Underrepresentation of certain Arabic dialects may lead to degraded performance on vernacular inputs.
Recommendations
Users should fine‑tune on domain‑specific or dialectal datasets to mitigate biases and improve performance across varied Arabic variants.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
model_id = "gizadatateam/ModernAraBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("اللغة العربية [MASK] عالمية."))
Requires Hugging Face Transformers ≥ 4.48.3.
Training Details
Training Data
- OSCAR‑2301: ~1 T tokens from Common Crawl (cc0‑1.0)
- Wikimedia Wikipedia: Cleaned Arabic articles from latest dumps
- Additional Web Corpora: Various news and web‑crawl sources (undocumented)
Training Procedure
More information needed on exact hyperparameters and hardware setup.
Evaluation
No public evaluation benchmarks or scores have been released for this specific checkpoint.
Model Examination
No interpretability analyses have been published for this checkpoint.
Technical Specifications
Model Architecture and Objective
- Encoder‑only Transformer, 22 layers, 12 heads, hidden size 768, GeGLU activations, RoPE positional embeddings, alternating local/global attention.
Compute Infrastructure
- Compatible with CPU and GPU inference via Transformers.
Software
- Transformers v4.48.3, PyTorch backend.
Citation
If you use this model, please cite:
Model Card Authors
Giza Data Team (Ahmed Eldamaty, Mohamed Maher, Mohamed ElBehery, Mariam Ashraf)
Model Card Contact
- Downloads last month
- 6
Model tree for gizadatateam/ModernAraBERT
Base model
answerdotai/ModernBERT-base