This model has been pushed to the Hub using the PytorchModelHubMixin integration:

ALIF Base 100M

ALIF Base 100M is an Urdu generative language model from the ALIF الف series (a Final Year Project at Habib University), developed by Orature AI.

Model Details

Developed by: Orature AI (S.M Ali Naqvi, Zainab Haider, Haya Fatima, Ali M Asad, Hammad Sajid)
Supervised by: Dr. Abdul Samad (Habib University)
Model type: Decoder-only Transformer, GPT-like
Variant: ALIF-Base-100M
Language(s) (NLP): Urdu (ur)
License: Apache 2.0
Architecture: Transformer (GPT-Based)
Framework: PyTorch
Tokeniezer: SentencePiece Custom Tokenizer
Hyperparameters::
- Vocabulary Size: 32000
- Embedding Size: 768
- Attention Heads: 12
- Layers: 12

How to Get Started with the Model

First you will need to download the modeling_gpt.py file from the repo. Once that's been done, you can define another file and use the following code to generate text from the model:

from modeling_gpt import GPTLanguageModel
from transformers import AutoTokenizer
import torch

model_name = "orature/ALIF-Base-100M"
model = GPTLanguageModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# For text generation
prompt_urdu = "ایک دفعہ کا ذکر ہے کہ " # "Once upon a time, "
inputs = tokenizer.encode(prompt_urdu)
inputs_tensor = torch.tensor(inputs).unsqueeze(0)  # Add batch dimension

# Generate text
outputs = model.generate(inputs_tensor, max_new_tokens=64, temperature=0.7)
outputs_tensor = torch.tensor(outputs).unsqueeze(0)
generated_text = tokenizer.decode(outputs_tensor[0].squeeze().tolist())

print(f"Prompt: {prompt_urdu}")
print(f"Generated Text: {generated_text}")

Model Description

ALIF Base 100M is designed to generate coherent and contextually relevant Urdu text. It leverages a custom Urdu tokenizer trained on the ALIF-Urdu-Corpus and was pretrained on a large corpus of diverse Urdu text.

Key Features:

Optimized for Urdu language nuances.
Strong foundational capabilities for further fine-tuning (for base models)
Capable of generating next tokens in a sequence, making it suitable for various text generation tasks.
Part of a series aiming to provide efficient and accessible SLMs for Urdu.

Intended Uses & Limitations

Intended Uses:

Text Generation: Creative writing, content generation, story completion in Urdu.
Research: Base for further research in Urdu NLP, low-resource language modeling.
Fine-tuning: Can be fine-tuned for specific downstream tasks like sentiment analysis, summarization, or domain-specific chatbots in Urdu.
Educational Purposes: Understanding SLM behavior for Urdu.
Limitations:
The model is primarily trained on Urdu and may not perform well on other languages or code-switched text unless specifically designed for it (e.g., an Ur-En variant).
As a base generative model, it may generate plausible-sounding but incorrect or nonsensical information (hallucinations).
The model may reflect biases present in the training data. The ALIF-Urdu-Corpus was curated from diverse sources, but biases (e.g., societal, gender, regional) may still exist.
Performance on highly specific or technical domains may be limited without further fine-tuning.
The model does not have real-time knowledge and its information is limited to its training data.
Safety: While efforts are made to curate data, the model might generate offensive, harmful, or inappropriate content. Users should implement appropriate safeguards for downstream applications.

Out-of-Scope Uses:

Generating high-stakes advice (medical, legal, financial) without human oversight.
Impersonation or generating misleading information.
Applications that could lead to harm or discrimination.
Complex scientific, technical, mathematical, or legal reasoning without further fine-tuning.
Any use that violates ethical guidelines or legal standards.