rupakrpk93's picture
Upload README.md with huggingface_hub
ee7b978 verified
|
raw
history blame
4.04 kB
metadata
language: or
license: apache-2.0
tags:
  - odia
  - language-model
  - text-generation
  - causal-lm
datasets:
  - OdiaGenAIdata/fine_web2_odia_pt
  - bigscience-data/roots_indic-or_indic_nlp_corpus
widget:
  - text: ଓଡିଆ ଭାଷା

Odia Language Model (odia_tokenizers_test)

Model Description

This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.

Model Architecture

  • Vocabulary Size: 50,000 tokens
  • Context Length: 256 tokens
  • Number of Layers: 24
  • Number of Heads: 12
  • Hidden Size: 768
  • Parameters: ~354M

Installation

First, install the required dependencies:

pip install torch sentencepiece huggingface-hub

Usage

Quick Start

Here's how to use the model for text generation:

import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np

# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="odia_tokenizer.model"
)

tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)

# Step 2: Download model files
model_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="pytorch_model.bin"
)

config_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="config.json"
)

# Step 3: Load the model (you need the model class definition)
# Note: You'll need to define the GPT model architecture
# The model architecture code is available in the repository

# Step 4: Generate text
def generate_odia_text(prompt, max_length=100):
    # Encode the prompt
    input_ids = tokenizer.encode_as_ids(prompt)
    input_tensor = torch.tensor(input_ids).unsqueeze(0)
    
    # Generate (assuming model is loaded)
    # output = model.generate(input_tensor, max_length)
    
    # Decode the output
    # generated_text = tokenizer.decode(output.squeeze().tolist())
    # return generated_text
    pass

Example Usage

# Example 1: Simple text generation
prompt = "ବର୍ଷା"
# generated_text = generate_odia_text(prompt, max_length=200)
# print(generated_text)

# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Encoded: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Full Implementation Example

For a complete working example with the model architecture:

# The full model architecture and implementation
# is available in the repository files.
# Please refer to the model implementation for complete code.

Training Details

Training Hyperparameters

  • Max Iterations: 40,000
  • Learning Rate: 3e-4 with cosine decay
  • Batch Size: 16
  • Gradient Accumulation Steps: 8
  • Warmup Steps: 2,000
  • Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
  • Mixed Precision: bfloat16/float16

Training Data

The model was trained on a combination of:

  1. OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
  2. bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP

Total training samples: ~3.8M texts

Limitations

  • Maximum context length is 256 tokens
  • Trained specifically on Odia text, may not perform well on other languages
  • May generate repetitive text for very long sequences
  • The model requires the custom GPT architecture code to run

Intended Use

This model is intended for:

  • Odia text generation
  • Odia language research
  • Educational purposes
  • Building Odia language applications

Citation

If you use this model, please cite:

@misc{odia_gpt_2024,
  title={Odia GPT Language Model},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}

Contact

For questions and feedback, please open an issue on the model repository.