license: 

tags:
  - modernbert
  - masked-lm
  - arabic
datasets:
  - oscar-corpus/OSCAR-2301
  - wikimedia/wikipedia
language:
  - ar
  - eng
model_type: modernbert
pipeline_tag: fill-mask
library_name: transformers

This model card presents gizadatateam/ModernAraBERT, a masked‑LM based on the ModernBERT architecture, optimized for Arabic with rotary embeddings, and alternating local–global attention.
It was pre‑trained on large‑scale Arabic corpora—e.g., OSCAR‑2301—to support very long contexts (up to 8 192 tokens) for fill‑mask tasks in the Hugging Face Transformers ecosystem.

Model Details

Model Description

Arabic‑ModernBERT is built on the ModernBERT encoder with a Masked Language Modeling objective, delivering robust contextual embeddings for a variety of Arabic NLP tasks.

  • Developed by: Giza Data Team (Ahmed Sami, Mohamed ElBehery, Mariam Ashraf, Reem Ayman, Ali Sameh, Ali Badawy)
  • Model type: ModernBertForMaskedLM
  • Language(s): Arabic (ar), English ()eng
  • License: .....
  • Fine‑tuned from: AnswerDotAI/ModernBERT (base)

Model Sources

  • Repository: https://huggingface.co/gizadatateam/ModernAraBERT
  • Base paper: Warner et al. (2024), Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference,

Uses

Direct Use

Use this model directly with the fill-mask pipeline in Transformers for token infilling tasks in Arabic and English.

Downstream Use

  • Feature extraction for classification, named‑entity recognition, question answering, and retrieval pipelines.

Out‑of‑Scope Use

  • Not intended for generative text beyond masked token prediction or for languages other than Arabic.

Bias, Risks, and Limitations

  • Data bias: Trained on web‑scraped OSCAR and Wikipedia; may reflect misinformation or offensive content inherent in these sources.
  • Dialectal coverage: Underrepresentation of certain Arabic dialects may lead to degraded performance on vernacular inputs.

Recommendations

Users should fine‑tune on domain‑specific or dialectal datasets to mitigate biases and improve performance across varied Arabic variants.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model_id = "gizadatateam/ModernAraBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("اللغة العربية [MASK] عالمية."))

Requires Hugging Face Transformers ≥ 4.48.3.

Training Details

Training Data

  • OSCAR‑2301: ~1 T tokens from Common Crawl (cc0‑1.0)
  • Wikimedia Wikipedia: Cleaned Arabic articles from latest dumps
  • Additional Web Corpora: Various news and web‑crawl sources (undocumented)

Training Procedure

More information needed on exact hyperparameters and hardware setup.

Evaluation

No public evaluation benchmarks or scores have been released for this specific checkpoint.

Model Examination

No interpretability analyses have been published for this checkpoint.

Technical Specifications

Model Architecture and Objective

  • Encoder‑only Transformer, 22 layers, 12 heads, hidden size 768, GeGLU activations, RoPE positional embeddings, alternating local/global attention.

Compute Infrastructure

  • Compatible with CPU and GPU inference via Transformers.

Software

  • Transformers v4.48.3, PyTorch backend.

Citation

If you use this model, please cite:


Model Card Authors

Giza Data Team (Ahmed Eldamaty, Mohamed Maher, Mohamed ElBehery, Mariam Ashraf)

Model Card Contact

Downloads last month
6
Safetensors
Model size
211M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gizadatateam/ModernAraBERT

Finetuned
(629)
this model

Datasets used to train gizadatateam/ModernAraBERT