odia_tokenizers_test / README.md

rupakrpk93

Upload README.md with huggingface_hub

ee7b978 verified about 2 months ago

preview code

raw

history blame

4.04 kB

metadata

language: or
license: apache-2.0
tags:
  - odia
  - language-model
  - text-generation
  - causal-lm
datasets:
  - OdiaGenAIdata/fine_web2_odia_pt
  - bigscience-data/roots_indic-or_indic_nlp_corpus
widget:
  - text: ଓଡିଆ ଭାଷା

Odia Language Model (odia_tokenizers_test)

Model Description

This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.

Model Architecture

Vocabulary Size: 50,000 tokens
Context Length: 256 tokens
Number of Layers: 24
Number of Heads: 12
Hidden Size: 768
Parameters: ~354M

Installation

First, install the required dependencies:

pip install torch sentencepiece huggingface-hub

Usage

Quick Start

Here's how to use the model for text generation:

import torch
import sentencepiece as sp
from huggingface_hub import hf_hub_download
import numpy as np

# Step 1: Download and load the tokenizer
tokenizer_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="odia_tokenizer.model"
)

tokenizer = sp.SentencePieceProcessor()
tokenizer.load(tokenizer_path)

# Step 2: Download model files
model_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="pytorch_model.bin"
)

config_path = hf_hub_download(
    repo_id="rupakrpk93/odia_tokenizers_test",
    filename="config.json"
)

# Step 3: Load the model (you need the model class definition)
# Note: You'll need to define the GPT model architecture
# The model architecture code is available in the repository

# Step 4: Generate text
def generate_odia_text(prompt, max_length=100):
    # Encode the prompt
    input_ids = tokenizer.encode_as_ids(prompt)
    input_tensor = torch.tensor(input_ids).unsqueeze(0)
    
    # Generate (assuming model is loaded)
    # output = model.generate(input_tensor, max_length)
    
    # Decode the output
    # generated_text = tokenizer.decode(output.squeeze().tolist())
    # return generated_text
    pass

Example Usage

# Example 1: Simple text generation
prompt = "ବର୍ଷା"
# generated_text = generate_odia_text(prompt, max_length=200)
# print(generated_text)

# Example 2: Encode and decode text
text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
encoded = tokenizer.encode_as_ids(text)
print(f"Encoded: {encoded}")

decoded = tokenizer.decode(encoded)
print(f"Decoded: {decoded}")

Full Implementation Example

For a complete working example with the model architecture:

# The full model architecture and implementation
# is available in the repository files.
# Please refer to the model implementation for complete code.

Training Details

Training Hyperparameters

Max Iterations: 40,000
Learning Rate: 3e-4 with cosine decay
Batch Size: 16
Gradient Accumulation Steps: 8
Warmup Steps: 2,000
Optimizer: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
Mixed Precision: bfloat16/float16

Training Data

The model was trained on a combination of:

OdiaGenAIdata/fine_web2_odia_pt - High-quality Odia web text
bigscience-data/roots_indic-or_indic_nlp_corpus - Odia corpus from Indic NLP

Total training samples: ~3.8M texts

Limitations

Maximum context length is 256 tokens
Trained specifically on Odia text, may not perform well on other languages
May generate repetitive text for very long sequences
The model requires the custom GPT architecture code to run

Intended Use

This model is intended for:

Odia text generation
Odia language research
Educational purposes
Building Odia language applications

Citation

If you use this model, please cite:

@misc{odia_gpt_2024,
  title={Odia GPT Language Model},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}

Contact

For questions and feedback, please open an issue on the model repository.