gizadatateam/ModernAraBERT

license: 

tags:
  - modernbert
  - masked-lm
  - arabic
datasets:
  - oscar-corpus/OSCAR-2301
  - wikimedia/wikipedia
language:
  - ar
  - eng
model_type: modernbert
pipeline_tag: fill-mask
library_name: transformers

This model card presents gizadatateam/ModernAraBERT, a masked‑LM based on the ModernBERT architecture, optimized for Arabic with rotary embeddings, and alternating local–global attention.
It was pre‑trained on large‑scale Arabic corpora—e.g., OSCAR‑2301—to support very long contexts (up to 8 192 tokens) for fill‑mask tasks in the Hugging Face Transformers ecosystem.

Model Details

Model Description

Arabic‑ModernBERT is built on the ModernBERT encoder with a Masked Language Modeling objective, delivering robust contextual embeddings for a variety of Arabic NLP tasks.

Developed by: Giza Data Team (Ahmed Sami, Mohamed ElBehery, Mariam Ashraf, Reem Ayman, Ali Sameh, Ali Badawy)
Model type: ModernBertForMaskedLM
Language(s): Arabic (ar), English ()eng
License: .....
Fine‑tuned from: AnswerDotAI/ModernBERT (base)

Model Sources

Repository: https://huggingface.co/gizadatateam/ModernAraBERT
Base paper: Warner et al. (2024), Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference,

Uses

Direct Use

Use this model directly with the fill-mask pipeline in Transformers for token infilling tasks in Arabic and English.

Downstream Use

Feature extraction for classification, named‑entity recognition, question answering, and retrieval pipelines.

Out‑of‑Scope Use

Not intended for generative text beyond masked token prediction or for languages other than Arabic.

Bias, Risks, and Limitations

Data bias: Trained on web‑scraped OSCAR and Wikipedia; may reflect misinformation or offensive content inherent in these sources.
Dialectal coverage: Underrepresentation of certain Arabic dialects may lead to degraded performance on vernacular inputs.

Recommendations

Users should fine‑tune on domain‑specific or dialectal datasets to mitigate biases and improve performance across varied Arabic variants.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

model_id = "gizadatateam/ModernAraBERT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForMaskedLM.from_pretrained(model_id)

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("اللغة العربية [MASK] عالمية."))

Requires Hugging Face Transformers ≥ 4.48.3.

Training Details

Training Data

OSCAR‑2301: ~1 T tokens from Common Crawl (cc0‑1.0)
Wikimedia Wikipedia: Cleaned Arabic articles from latest dumps
Additional Web Corpora: Various news and web‑crawl sources (undocumented)

Training Procedure

More information needed on exact hyperparameters and hardware setup.

Evaluation

No public evaluation benchmarks or scores have been released for this specific checkpoint.

Model Examination

No interpretability analyses have been published for this checkpoint.

Technical Specifications

Model Architecture and Objective

Encoder‑only Transformer, 22 layers, 12 heads, hidden size 768, GeGLU activations, RoPE positional embeddings, alternating local/global attention.

Compute Infrastructure

Compatible with CPU and GPU inference via Transformers.

Software

Transformers v4.48.3, PyTorch backend.

Citation

If you use this model, please cite:

Model Card Authors

Giza Data Team (Ahmed Eldamaty, Mohamed Maher, Mohamed ElBehery, Mariam Ashraf)

gizadatateam
/

ModernAraBERT