Odia Language Model (odia_tokenizers_test)
Model Description
This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.
Model Architecture
- Vocabulary Size: 50,000 tokens
- Context Length: 256 tokens
- Number of Layers: 24
- Number of Heads: 12
- Hidden Size: 768
- Parameters: ~354M
Installation
First, install the required dependencies:
pip install torch sentencepiece huggingface-hub
Usage
Quick Start
Here's how to use the model for text generation:
import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np
# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="odia_tokenizer.model"
)
tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)
# Step 2: Download model files
model_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="pytorch_model.bin"
)
config_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="config.json"
)
# Step 3: Load the model architecture and weights
# First, download the model architecture file
architecture_path = hf_hub_download(
repo_id="rupakrpk93/odia_tokenizers_test",
filename="model_architecture.py"
)
# Import the model classes
import sys
import importlib.util
spec = importlib.util.spec_from_file_location("model_architecture", architecture_path)
model_module = importlib.util.module_from_spec(spec)
sys.modules["model_architecture"] = model_module
spec.loader.exec_module(model_module)
# Import the classes we need
GPTConfig = model_module.GPTConfig
GPT = model_module.GPT
# Create model configuration
config = GPTConfig()
# Initialize and load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT(config)
# Load the pretrained weights
checkpoint = torch.load(model_path, map_location=device)
# Check if the state_dict is nested and extract it if necessary
if isinstance(checkpoint, dict) and 'model' in checkpoint:
state_dict = checkpoint['model']
else:
state_dict = checkpoint
# Remove the 'model.' prefix from keys if present
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if k.startswith('model.'):
new_state_dict[k[6:]] = v # Remove 'model.' prefix
else:
new_state_dict[k] = v
model.load_state_dict(new_state_dict)
model = model.to(device)
model.eval()
print(f"Model loaded successfully on {device}")
# Step 4: Generate text function
def generate_odia_text(prompt, max_length=100, temperature=0.8):
# Encode the prompt
input_ids = tokenizer.encode_as_ids(prompt)
input_tensor = torch.tensor(input_ids).unsqueeze(0).to(device)
# Generate
with torch.no_grad():
output = model.generate(input_tensor, max_length, temperature=temperature)
# Decode the output
generated_text = tokenizer.decode(output.squeeze().tolist())
return generated_text
Example Usage
# Example 1: Simple text generation
prompt = "ସେ କାଲି ସ୍କୁଲକୁ"
generated_text = generate_odia_text(prompt, max_length=200)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")
# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")
Training Details
Training Hyperparameters
- Max Iterations: 40,000
- Learning Rate: 3e-4 with cosine decay
- Batch Size: 16
- Gradient Accumulation Steps: 8
- Warmup Steps: 2,000
- Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
- Mixed Precision: bfloat16/float16
Training Data
The model was trained on a combination of:
- OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
- bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP
- Custom curated Odia dataset - Additional hand-curated Odia texts
Total training samples: ~4M+ texts
Limitations
- Maximum context length is 256 tokens
- Trained specifically on Odia text, may not perform well on other languages
- May generate repetitive text for very long sequences
Intended Use
This model is intended for:
- Odia text generation
- Odia language research
- Educational purposes
- Building Odia language applications
Citation
If you use this model, please cite:
@misc{odia_gpt_2024,
title={Odia GPT Language Model},
author={Your Name},
year={2024},
publisher={HuggingFace}
}
Contact
For questions and feedback, please open an issue on the model repository.
- Downloads last month
- 49