Model Card for FaseehGPT
Model Details
- Model Name: FaseehGPT
- Model Type: Decoder-only Transformer (GPT-style)
- Repository: alphatechlogics/FaseehGPT
- Version: 1.1
- Builder: Alphatechlogics ๐ GitHub | ๐ค Hugging Face | ๐ผ LinkedIn
- Developer: Ahsan Umar ๐ GitHub | ๐ค Hugging Face | ๐ผ LinkedIn
- Date: July 10, 2025
- License: Apache 2.0
- Framework: PyTorch, Hugging Face Transformers
- Language: Arabic
- Intended Use: Text generation and language modeling for Arabic text
FaseehGPT is a GPT-style language model designed for Arabic text processing, trained on a subset of Arabic datasets to generate coherent and contextually relevant text. It uses a pre-trained Arabic tokenizer (asafaya/bert-base-arabic
) and is optimized for resource-constrained environments like Google Colab (free GPU). The model was trained for 20 epochs with checkpoints and sample generations.
Model Architecture
Architecture: Decoder-only transformer with multi-head self-attention and feed-forward layers
Parameters:
- Vocabulary Size: ~32,000 (from
asafaya/bert-base-arabic
tokenizer) - Embedding Dimension: 512
- Number of Layers: 12
- Number of Attention Heads: 8
- Feed-forward Dimension: 2048
- Total Parameters: ~70.7 million
- Vocabulary Size: ~32,000 (from
Configuration:
- Maximum Sequence Length: 512
- Dropout Rate: 0.1
- Activation Function: GELU
Weight Initialization: Normal distribution (mean = 0, std = 0.02)
Special Features: Supports top-k and top-p sampling; weight tying between input and output embeddings
Training Details
Datasets
arbml/Arabic_News: 7,114,814 news article texts
arbml/Arabic_Literature: 1,592,629 literary texts
Subset Used: 50,000 texts (randomly sampled)
- Training Set: 45,000 (90%)
- Validation Set: 5,000 (10%)
Training Configuration
- Epochs: 20
- Learning Rate: 3e-4 (Karpathy constant)
- Optimizer: AdamW (weight decay = 0.01)
- Scheduler: Linear warmup (10% of steps) with decay
- Batch Size: Effective 16 (4 gradient accumulation steps)
- Hardware: Kaggle (P100)
- Training Duration: 8.18 hours
- Checkpoint: Saved at epoch 20
Sample Generated Text (Epoch 20)
Prompt 1: "ุงููุบุฉ ุงูุนุฑุจูุฉ"
Output:
ุงููุบุฉ ุงูุนุฑุจูุฉ ุงูุฑุจ ููุญ ุงูู ูู ุง ุฐูู ูุฐู ุงูุจูุงู ุดุนุฑู ูุงูู ุงูุงุณุชุงุฐุฑ ู ู ูุชุฌ ู ุนูู ูู ูููู ูุตููู ูู ุงููุฑูุฉ ุงูุชููุงุงููุง ุงูุฎุทุงุจ ู ุงู ู ุณูู ูู ุ ุชูููุจุฉ ูุญูุงุฉ โุฒุฉ ุงูุดุฎุตูุฉ ู ุณูู ุดุจู ู ูุฐ
Prompt 2: "ูุงู ูุง ู
ูุงู ูู ูุฏูู
ุงูุฒู
ุงู"
Output:
ูุงู ูุง ู ูุงู ูู ูุฏูู ุงูุฒู ุงู ุงูุงูุณุงู ุงูุงูุณุงู ุจุนุถ ูุง ุงูุฑ ููุฏ ุงูุงูุณุงู ุฐูู ุงููุงุฑูุงุฑู ุนุฑุถ ุนุฑุถ ูุฑูู.ุฑุญ ูุดุง ุงูู ุทููุจ ูุนู ู ูููุชุจ ุงูุงุฑุฏูู ูุจุฏู ุงูุณุงุจู ูุงู " ูุฑูุฏ " ุตูุฑุฉ ููุง ูุงูู ุง " ุงูุชู ุงููุนูู ุงูุตุญูุญ ุจู ุน ููููุท ". ูุฑูุฏ ูุตุฑ ุชูููู ุฏููุชูุชู ูุฏ ูู ุซู ุงููุฉ ุฌุณุฏ ". ุงูุตุญููุฉ ุงูู ุงูุงุณูุงู ุงูุจูุฏ ุงูุชู " ูุง ู ู ุซุงูุซุฉ ุดุจู ูุงูุช ุจุตูุชู ูู ุงููุนูุฏูุง ุงูุจุฑ ุงูุชู ูู ู ุง ู ู ุ ุฑุญุจ ู ูู ุฉ ู ุฒ ุงูู ููุจุฑ ุจุณุฑุนุฉุงููุฉ ุ ุงูุงุฑุฌุญ ู ุง ุนู ุจู ุงูููุงุจ ูู
Analysis: The generated text shows some coherence but includes grammatical and semantic inconsistencies. The model may benefit from further training or fine-tuning.
Usage
FaseehGPT can be used to generate Arabic text from a prompt. Example code:
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("alphatechlogics/FaseehGPT", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("alphatechlogics/FaseehGPT")
# Generate text
prompt = "ุงูุณูุงู
ุนูููู
"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=100, temperature=1.0, top_k=50, top_p=0.9)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Parameters for Generation
max_new_tokens
: Max tokens to generate (e.g., 100)temperature
: Controls randomness (default: 1.0)top_k
: Limits sampling to top-k tokens (default: 50)top_p
: Nucleus sampling threshold (default: 0.9)
Expected Output: Arabic text that continues the given prompt, depending on training quality and settings.
Dataset Description
Source: Hugging Face Datasets
Used Datasets:
arbml/Arabic_News
: News across diverse topics with formal Arabicarbml/Arabic_Literature
: Novels and poetry, providing rich language variety
Total Texts: 8,707,443 (full); 50,000 used for training
Preprocessing
- Tokenized using
asafaya/bert-base-arabic
- Long texts split into overlapping chunks (
stride = max_seq_len // 2
) - Special tokens:
<SOS>
,<EOS>
,<PAD>
,<UNK>
Evaluation
- Metrics: Cross-entropy loss (training and validation)
- Status: Loss metrics unavailable due to incomplete logging
- Observations: Generated samples show partial learning; some incoherence remains
Recommendations
- Extract loss from checkpoint
model_checkpoint_epoch_20.pt
- Use verbose logging in future training
- Add evaluation metrics: Perplexity, BLEU
- Try smaller models (e.g.,
embed_dim=256
,num_layers=6
) for faster Colab testing
Limitations
- Generated Text Quality: Inconsistent coherence suggests undertraining
- Resource Constraints: Small subset used due to Colab GPU limits
- Language Specificity: Only Arabic supported; others untested
- Training Duration: 8.18 hours insufficient for full dataset
Ethical Considerations
- Bias: May reflect cultural or topical biases from source data
- Usage: For research/non-commercial use; validate outputs
- Privacy: Datasets are public; comply with Hugging Face policies
How to Contribute
- Repo: alphatechlogics/FaseehGPT
- Issues: Report bugs or suggest features via issue tracker
- Training: Resume on full dataset or better hardware
- Evaluation: Add scripts for BLEU, perplexity, etc.
Citation
@misc{faseehgpt2025,
title = {FaseehGPT: An Arabic Language Model},
author = {Ahsan Umar, Rohma},
year = {2025},
url = {https://huggingface.co/alphatechlogics/FaseehGPT}
}
- Downloads last month
- 185