Model Card for FaseehGPT

Model Details

Model Name: FaseehGPT
Model Type: Decoder-only Transformer (GPT-style)
Repository: alphatechlogics/FaseehGPT
Version: 1.1
Builder: Alphatechlogics 🔗 GitHub | 🤗 Hugging Face | 💼 LinkedIn
Developer: Ahsan Umar 🔗 GitHub | 🤗 Hugging Face | 💼 LinkedIn
Date: July 10, 2025
License: Apache 2.0
Framework: PyTorch, Hugging Face Transformers
Language: Arabic
Intended Use: Text generation and language modeling for Arabic text

FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.

Model Architecture

Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers
Parameters:
- Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
- Embedding Dimension: 512
- Number of Layers: 12
- Number of Attention Heads: 8
- Feed-forward Dimension: 2048
- Total Parameters: ~70.7 million
Configuration:
- Maximum Sequence Length: 512
- Dropout Rate: 0.1
- Activation Function: GELU
Weight Initialization: Normal distribution (mean = 0, std = 0.02)
Special Features: Supports top-k and top-p sampling; weight tying between input and output embeddings

Training Details

Datasets

arbml/Arabic_News: 7,114,814 news article texts
arbml/Arabic_Literature: 1,592,629 literary texts
Subset Used: 50,000 texts (randomly sampled)
- Training Set: 45,000 (90%)
- Validation Set: 5,000 (10%)

Training Configuration

Epochs: 20
Learning Rate: 3e-4 (Karpathy constant)
Optimizer: AdamW (weight decay = 0.01)
Scheduler: Linear warmup (10% of steps) with decay
Batch Size: Effective 16 (4 gradient accumulation steps)
Hardware: Kaggle (P100)
Training Duration: 8.18 hours
Checkpoint: Saved at epoch 20

Sample Generated Text (Epoch 20)

Prompt 1: "اللغة العربية" Output:

اللغة العربية اقرب ويح الي كما ذلك هذه البيان شعره قاله الاستاذر من وتج معهم فمنليل وصوله له الفرقة التيهااهها الخطاب ماه مسلمفن ، تقولبة وحياة –زة الشخصية مسلم شبه منذ

Prompt 2: "كان يا مكان في قديم الزمان" Output:

كان يا مكان في قديم الزمان الانسان الانسان بعض لا انر لقد الانسان ذلك انلاركارك عرض عرض كروي.رح نشا المطلوب وعمل كنكتب الاردني فبدي السابق كان " يريد " صورة ولا وانما " التي النعيم الصحيح بمع للنفط ". يريد قصر توفيق ديكتوتو قد في ثمانية جسد ". الصحيفة انه الاسلام البلد التي " لا من ثالثة شبه كانت بصفته في الوعيدها انبر التي في ما من ، رحب مهمة مز انه ليبر بسرعةالية ، الارجح ما عن به انقلاب في

Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.

Usage

FaseehGPT can be used to generate Arabic text from a prompt. Example code:

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

# Generate text
prompt = "السلام عليكم"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Parameters for Generation

max_new_tokens: Max tokens to generate (e.g., 100)
temperature: Controls randomness (default: 1.0)
top_k: Limits sampling to top-k tokens (default: 50)
top_p: Nucleus sampling threshold (default: 0.9)

Expected Output: Arabic text that continues the given prompt, depending on training quality and settings.

Dataset Description

Source: Hugging Face Datasets
Used Datasets:
- arbml/Arabic_News: News across diverse topics with formal Arabic
- arbml/Arabic_Literature: Novels and poetry, providing rich language variety
Total Texts: 8,707,443 (full); 50,000 used for training

Preprocessing

Tokenized using asafaya/bert-base-arabic
Long texts split into overlapping chunks (stride = max_seq_len // 2)
Special tokens: <SOS>, <EOS>, <PAD>, <UNK>

Evaluation

Metrics: Cross-entropy loss (training and validation)
Status: Loss metrics unavailable due to incomplete logging
Observations: Generated samples show partial learning; some incoherence remains

Recommendations

Extract loss from checkpoint model_checkpoint_epoch_20.pt
Use verbose logging in future training
Add evaluation metrics: Perplexity, BLEU
Try smaller models (e.g., embed_dim=256, num_layers=6) for faster Colab testing

Limitations

Generated Text Quality: Inconsistent coherence suggests undertraining
Resource Constraints: Small subset used due to Colab GPU limits
Language Specificity: Only Arabic supported; others untested
Training Duration: 8.18 hours insufficient for full dataset

Ethical Considerations

Bias: May reflect cultural or topical biases from source data
Usage: For research/non-commercial use; validate outputs
Privacy: Datasets are public; comply with Hugging Face policies

How to Contribute

Repo: alphatechlogics/FaseehGPT
Issues: Report bugs or suggest features via issue tracker
Training: Resume on full dataset or better hardware
Evaluation: Add scripts for BLEU, perplexity, etc.

Citation

@misc{faseehgpt2025,
  title     = {FaseehGPT: An Arabic Language Model},
  author    = {Ahsan Umar, Rohma},
  year      = {2025},
  url       = {https://huggingface.co/alphatechlogics/FaseehGPT}
}

alphatechlogics
/

FaseehGPT