Odia Language Model (odia_tokenizers_test)

Model Description

This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.

Model Architecture

Vocabulary Size: 50,000 tokens
Context Length: 256 tokens
Number of Layers: 24
Number of Heads: 12
Hidden Size: 768
Parameters: ~354M

Installation

First, install the required dependencies:

pip install torch sentencepiece huggingface-hub

Usage

Quick Start

Here's how to use the model for text generation:

import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np

# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="odia_tokenizer.model"
)

tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)

# Step 2: Download model files
model_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="pytorch_model.bin"
)

config_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="config.json"
)

# Step 3: Load the model architecture and weights
# First, download the model architecture file
architecture_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="model_architecture.py"
)

# Import the model classes
import sys
import importlib.util
spec = importlib.util.spec_from_file_location("model_architecture", architecture_path)
model_module = importlib.util.module_from_spec(spec)
sys.modules["model_architecture"] = model_module
spec.loader.exec_module(model_module)

# Import the classes we need
GPTConfig = model_module.GPTConfig
GPT = model_module.GPT

# Create model configuration
config = GPTConfig()

# Initialize and load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT(config)

# Load the pretrained weights
checkpoint = torch.load(model_path, map_location=device)

# Check if the state_dict is nested and extract it if necessary
if isinstance(checkpoint, dict) and 'model' in checkpoint:
    state_dict = checkpoint['model']
else:
    state_dict = checkpoint

# Remove the 'model.' prefix from keys if present
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    if k.startswith('model.'):
        new_state_dict[k[6:]] = v  # Remove 'model.' prefix
    else:
        new_state_dict[k] = v

model.load_state_dict(new_state_dict)

model = model.to(device)
model.eval()
print(f"Model loaded successfully on {device}")

# Step 4: Generate text function
def generate_odia_text(prompt, max_length=100, temperature=0.8):
    # Encode the prompt
    input_ids = tokenizer.encode_as_ids(prompt)
    input_tensor = torch.tensor(input_ids).unsqueeze(0).to(device)
    
    # Generate
    with torch.no_grad():
        output = model.generate(input_tensor, max_length, temperature=temperature)
    
    # Decode the output
    generated_text = tokenizer.decode(output.squeeze().tolist())
    return generated_text

Example Usage

# Example 1: Simple text generation
prompt = "ସେ କାଲି ସ୍କୁଲକୁ"
generated_text = generate_odia_text(prompt, max_length=200)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Training Details

Training Hyperparameters

Max Iterations: 40,000
Learning Rate: 3e-4 with cosine decay
Batch Size: 16
Gradient Accumulation Steps: 8
Warmup Steps: 2,000
Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
Mixed Precision: bfloat16/float16

Training Data

The model was trained on a combination of:

OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP
Custom curated Odia dataset - Additional hand-curated Odia texts

Total training samples: ~4M+ texts

Limitations

Maximum context length is 256 tokens
Trained specifically on Odia text, may not perform well on other languages
May generate repetitive text for very long sequences

Intended Use

This model is intended for:

Odia text generation
Odia language research
Educational purposes
Building Odia language applications

Citation

If you use this model, please cite:

@misc{odia_gpt_2024,
  title={Odia GPT Language Model},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}

Contact

For questions and feedback, please open an issue on the model repository.

Downloads last month: 49

rupakrpk93
/

odia_tokenizers_test