Odia Language Model (odia_tokenizers_test)

Model Description

This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.

Model Architecture

  • Vocabulary Size: 50,000 tokens
  • Context Length: 256 tokens
  • Number of Layers: 24
  • Number of Heads: 12
  • Hidden Size: 768
  • Parameters: ~354M

Installation

First, install the required dependencies:

pip install torch sentencepiece huggingface-hub

Usage

Quick Start

Here's how to use the model for text generation:

import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np

# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="odia_tokenizer.model"
)

tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)

# Step 2: Download model files
model_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="pytorch_model.bin"
)

config_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="config.json"
)

# Step 3: Load the model architecture and weights
# First, download the model architecture file
architecture_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="model_architecture.py"
)

# Import the model classes
import sys
import importlib.util
spec = importlib.util.spec_from_file_location("model_architecture", architecture_path)
model_module = importlib.util.module_from_spec(spec)
sys.modules["model_architecture"] = model_module
spec.loader.exec_module(model_module)

# Import the classes we need
GPTConfig = model_module.GPTConfig
GPT = model_module.GPT

# Create model configuration
config = GPTConfig()

# Initialize and load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT(config)

# Load the pretrained weights
checkpoint = torch.load(model_path, map_location=device)

# Check if the state_dict is nested and extract it if necessary
if isinstance(checkpoint, dict) and 'model' in checkpoint:
    state_dict = checkpoint['model']
else:
    state_dict = checkpoint

# Remove the 'model.' prefix from keys if present
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    if k.startswith('model.'):
        new_state_dict[k[6:]] = v  # Remove 'model.' prefix
    else:
        new_state_dict[k] = v

model.load_state_dict(new_state_dict)

model = model.to(device)
model.eval()
print(f"Model loaded successfully on {device}")

# Step 4: Generate text function
def generate_odia_text(prompt, max_length=100, temperature=0.8):
    # Encode the prompt
    input_ids = tokenizer.encode_as_ids(prompt)
    input_tensor = torch.tensor(input_ids).unsqueeze(0).to(device)
    
    # Generate
    with torch.no_grad():
        output = model.generate(input_tensor, max_length, temperature=temperature)
    
    # Decode the output
    generated_text = tokenizer.decode(output.squeeze().tolist())
    return generated_text

Example Usage

# Example 1: Simple text generation
prompt = "ସେ କାଲି ସ୍କୁଲକୁ"
generated_text = generate_odia_text(prompt, max_length=200)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Original: {text}")
print(f"Encoded: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Training Details

Training Hyperparameters

  • Max Iterations: 40,000
  • Learning Rate: 3e-4 with cosine decay
  • Batch Size: 16
  • Gradient Accumulation Steps: 8
  • Warmup Steps: 2,000
  • Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
  • Mixed Precision: bfloat16/float16

Training Data

The model was trained on a combination of:

  1. OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
  2. bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP
  3. Custom curated Odia dataset - Additional hand-curated Odia texts

Total training samples: ~4M+ texts

Limitations

  • Maximum context length is 256 tokens
  • Trained specifically on Odia text, may not perform well on other languages
  • May generate repetitive text for very long sequences

Intended Use

This model is intended for:

  • Odia text generation
  • Odia language research
  • Educational purposes
  • Building Odia language applications

Citation

If you use this model, please cite:

@misc{odia_gpt_2024,
  title={Odia GPT Language Model},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}

Contact

For questions and feedback, please open an issue on the model repository.

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train rupakrpk93/odia_tokenizers_test