Model Card for FaseehGPT

Model Details

  • Model Name: FaseehGPT
  • Model Type: Decoder-only Transformer (GPT-style)
  • Repository: alphatechlogics/FaseehGPT
  • Version: 1.1
  • Builder: Alphatechlogics ๐Ÿ”— GitHub | ๐Ÿค— Hugging Face | ๐Ÿ’ผ LinkedIn
  • Developer: Ahsan Umar ๐Ÿ”— GitHub | ๐Ÿค— Hugging Face | ๐Ÿ’ผ LinkedIn
  • Date: July 10, 2025
  • License: Apache 2.0
  • Framework: PyTorch, Hugging Face Transformers
  • Language: Arabic
  • Intended Use: Text generation and language modeling for Arabic text

FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (asafaya/bert-base-arabic) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.


Model Architecture

  • Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers

  • Parameters:

    • Vocabulary Size: ~32,000 (from asafaya/bert-base-arabic tokenizer)
    • Embedding Dimension: 512
    • Number of Layers: 12
    • Number of Attention Heads: 8
    • Feed-forward Dimension: 2048
    • Total Parameters: ~70.7 million
  • Configuration:

    • Maximum Sequence Length: 512
    • Dropout Rate: 0.1
    • Activation Function: GELU
  • Weight Initialization: Normal distribution (mean = 0, std = 0.02)

  • Special Features: Supports top-k and top-p sampling; weight tying between input and output embeddings


Training Details

Datasets

  • arbml/Arabic_News: 7,114,814 news article texts

  • arbml/Arabic_Literature: 1,592,629 literary texts

  • Subset Used: 50,000 texts (randomly sampled)

    • Training Set: 45,000 (90%)
    • Validation Set: 5,000 (10%)

Training Configuration

  • Epochs: 20
  • Learning Rate: 3e-4 (Karpathy constant)
  • Optimizer: AdamW (weight decay = 0.01)
  • Scheduler: Linear warmup (10% of steps) with decay
  • Batch Size: Effective 16 (4 gradient accumulation steps)
  • Hardware: Kaggle (P100)
  • Training Duration: 8.18 hours
  • Checkpoint: Saved at epoch 20

Sample Generated Text (Epoch 20)

Prompt 1: "ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ" Output:

ุงู„ู„ุบุฉ ุงู„ุนุฑุจูŠุฉ ุงู‚ุฑุจ ูˆูŠุญ ุงู„ูŠ ูƒู…ุง ุฐู„ูƒ ู‡ุฐู‡ ุงู„ุจูŠุงู† ุดุนุฑู‡ ู‚ุงู„ู‡ ุงู„ุงุณุชุงุฐุฑ ู…ู† ูˆุชุฌ ู…ุนู‡ู… ูู…ู†ู„ูŠู„ ูˆุตูˆู„ู‡ ู„ู‡ ุงู„ูุฑู‚ุฉ ุงู„ุชูŠู‡ุงุงู‡ู‡ุง ุงู„ุฎุทุงุจ ู…ุงู‡ ู…ุณู„ู…ูู† ุŒ ุชู‚ูˆู„ุจุฉ ูˆุญูŠุงุฉ โ€“ุฒุฉ ุงู„ุดุฎุตูŠุฉ ู…ุณู„ู… ุดุจู‡ ู…ู†ุฐ

Prompt 2: "ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู†" Output:

ูƒุงู† ูŠุง ู…ูƒุงู† ููŠ ู‚ุฏูŠู… ุงู„ุฒู…ุงู† ุงู„ุงู†ุณุงู† ุงู„ุงู†ุณุงู† ุจุนุถ ู„ุง ุงู†ุฑ ู„ู‚ุฏ ุงู„ุงู†ุณุงู† ุฐู„ูƒ ุงู†ู„ุงุฑูƒุงุฑูƒ ุนุฑุถ ุนุฑุถ ูƒุฑูˆูŠ.ุฑุญ ู†ุดุง ุงู„ู…ุทู„ูˆุจ ูˆุนู…ู„ ูƒู†ูƒุชุจ ุงู„ุงุฑุฏู†ูŠ ูุจุฏูŠ ุงู„ุณุงุจู‚ ูƒุงู† " ูŠุฑูŠุฏ " ุตูˆุฑุฉ ูˆู„ุง ูˆุงู†ู…ุง " ุงู„ุชูŠ ุงู„ู†ุนูŠู… ุงู„ุตุญูŠุญ ุจู…ุน ู„ู„ู†ูุท ". ูŠุฑูŠุฏ ู‚ุตุฑ ุชูˆููŠู‚ ุฏูŠูƒุชูˆุชูˆ ู‚ุฏ ููŠ ุซู…ุงู†ูŠุฉ ุฌุณุฏ ". ุงู„ุตุญูŠูุฉ ุงู†ู‡ ุงู„ุงุณู„ุงู… ุงู„ุจู„ุฏ ุงู„ุชูŠ " ู„ุง ู…ู† ุซุงู„ุซุฉ ุดุจู‡ ูƒุงู†ุช ุจุตูุชู‡ ููŠ ุงู„ูˆุนูŠุฏู‡ุง ุงู†ุจุฑ ุงู„ุชูŠ ููŠ ู…ุง ู…ู† ุŒ ุฑุญุจ ู…ู‡ู…ุฉ ู…ุฒ ุงู†ู‡ ู„ูŠุจุฑ ุจุณุฑุนุฉุงู„ูŠุฉ ุŒ ุงู„ุงุฑุฌุญ ู…ุง ุนู† ุจู‡ ุงู†ู‚ู„ุงุจ ููŠ

Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.


Usage

FaseehGPT can be used to generate Arabic text from a prompt. Example code:

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")

# Generate text
prompt = "ุงู„ุณู„ุงู… ุนู„ูŠูƒู…"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Parameters for Generation

  • max_new_tokens: Max tokens to generate (e.g., 100)
  • temperature: Controls randomness (default: 1.0)
  • top_k: Limits sampling to top-k tokens (default: 50)
  • top_p: Nucleus sampling threshold (default: 0.9)

Expected Output: Arabic text that continues the given prompt, depending on training quality and settings.


Dataset Description

  • Source: Hugging Face Datasets

  • Used Datasets:

    • arbml/Arabic_News: News across diverse topics with formal Arabic
    • arbml/Arabic_Literature: Novels and poetry, providing rich language variety
  • Total Texts: 8,707,443 (full); 50,000 used for training

Preprocessing

  • Tokenized using asafaya/bert-base-arabic
  • Long texts split into overlapping chunks (stride = max_seq_len // 2)
  • Special tokens: <SOS>, <EOS>, <PAD>, <UNK>

Evaluation

  • Metrics: Cross-entropy loss (training and validation)
  • Status: Loss metrics unavailable due to incomplete logging
  • Observations: Generated samples show partial learning; some incoherence remains

Recommendations

  • Extract loss from checkpoint model_checkpoint_epoch_20.pt
  • Use verbose logging in future training
  • Add evaluation metrics: Perplexity, BLEU
  • Try smaller models (e.g., embed_dim=256, num_layers=6) for faster Colab testing

Limitations

  • Generated Text Quality: Inconsistent coherence suggests undertraining
  • Resource Constraints: Small subset used due to Colab GPU limits
  • Language Specificity: Only Arabic supported; others untested
  • Training Duration: 8.18 hours insufficient for full dataset

Ethical Considerations

  • Bias: May reflect cultural or topical biases from source data
  • Usage: For research/non-commercial use; validate outputs
  • Privacy: Datasets are public; comply with Hugging Face policies

How to Contribute

  • Repo: alphatechlogics/FaseehGPT
  • Issues: Report bugs or suggest features via issue tracker
  • Training: Resume on full dataset or better hardware
  • Evaluation: Add scripts for BLEU, perplexity, etc.

Citation

@misc{faseehgpt2025,
  title     = {FaseehGPT: An Arabic Language Model},
  author    = {Ahsan Umar, Rohma},
  year      = {2025},
  url       = {https://huggingface.co/alphatechlogics/FaseehGPT}
}
Downloads last month
185
Safetensors
Model size
70.7M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train alphatechlogics/FaseehGPT