metadata
language: or
license: apache-2.0
tags:
- odia
- language-model
- text-generation
- causal-lm
datasets:
- OdiaGenAIdata/fine_web2_odia_pt
- bigscience-data/roots_indic-or_indic_nlp_corpus
widget:
- text: ଓଡିଆ ଭାଷା
Odia Language Model (odia_tokenizers_test)
Model Description
This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.
Model Architecture
- Vocabulary Size: 50,000 tokens
- Context Length: 256 tokens
- Number of Layers: 24
- Number of Heads: 12
- Hidden Size: 768
- Parameters: ~354M
Installation
First, install the required dependencies:
pip install torch sentencepiece huggingface-hub
Usage
Quick Start
Here's how to use the model for text generation:
import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np
# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="odia_tokenizer.model"
)
tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)
# Step 2: Download model files
model_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="pytorch_model.bin"
)
config_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="config.json"
)
# Step 3: Load the model (you need the model class definition)
# Note: You'll need to define the GPT model architecture
# The model architecture code is available in the repository
# Step 4: Generate text
def generate_odia_text(prompt, max_length=100):
# Encode the prompt
input_ids = tokenizer.encode_as_ids(prompt)
input_tensor = torch.tensor(input_ids).unsqueeze(0)
# Generate (assuming model is loaded)
# output = model.generate(input_tensor, max_length)
# Decode the output
# generated_text = tokenizer.decode(output.squeeze().tolist())
# return generated_text
pass
Example Usage
# Example 1: Simple text generation
prompt = "ବର୍ଷା"
# generated_text = generate_odia_text(prompt, max_length=200)
# print(generated_text)
# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Encoded: {encoded}")
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")
Full Implementation Example
For a complete working example with the model architecture:
# The full model architecture and implementation
# is available in the repository files.
# Please refer to the model implementation for complete code.
Training Details
Training Hyperparameters
- Max Iterations: 40,000
- Learning Rate: 3e-4 with cosine decay
- Batch Size: 16
- Gradient Accumulation Steps: 8
- Warmup Steps: 2,000
- Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
- Mixed Precision: bfloat16/float16
Training Data
The model was trained on a combination of:
- OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
- bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP
Total training samples: ~3.8M texts
Limitations
- Maximum context length is 256 tokens
- Trained specifically on Odia text, may not perform well on other languages
- May generate repetitive text for very long sequences
- The model requires the custom GPT architecture code to run
Intended Use
This model is intended for:
- Odia text generation
- Odia language research
- Educational purposes
- Building Odia language applications
Citation
If you use this model, please cite:
@misc{odia_gpt_2024,
title={Odia GPT Language Model},
author={Your Name},
year={2024},
publisher={HuggingFace}
}
Contact
For questions and feedback, please open an issue on the model repository.