rupakrpk93
/

odia_tokenizers_test

@@ -5,15 +5,18 @@ tags:
 - odia
 - language-model
 - text-generation
 datasets:
 - OdiaGenAIdata/fine_web2_odia_pt
 - bigscience-data/roots_indic-or_indic_nlp_corpus
 ---
 # Odia Language Model (odia_tokenizers_test)
 ## Model Description
-This is a GPT-based language model specifically trained for Odia language text generation.
 ### Model Architecture
 - **Vocabulary Size**: 50,000 tokens
@@ -21,9 +24,139 @@ This is a GPT-based language model specifically trained for Odia language text g
 - **Number of Layers**: 24
 - **Number of Heads**: 12
 - **Hidden Size**: 768
 ## Training Details
 - **Max Iterations**: 40,000
-- **Learning Rate**: 3e-4
 - **Batch Size**: 16
-- **Optimizer**: AdamW

 - odia
 - language-model
 - text-generation
+- causal-lm
 datasets:
 - OdiaGenAIdata/fine_web2_odia_pt
 - bigscience-data/roots_indic-or_indic_nlp_corpus
+widget:
+- text: "ଓଡିଆ ଭାଷା"
 ---
 # Odia Language Model (odia_tokenizers_test)
 ## Model Description
+This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.
 ### Model Architecture
 - **Vocabulary Size**: 50,000 tokens
 - **Number of Layers**: 24
 - **Number of Heads**: 12
 - **Hidden Size**: 768
+- **Parameters**: ~354M
+## Installation
+First, install the required dependencies:
+```bash
+pip install torch sentencepiece huggingface-hub
+```
+## Usage
+### Quick Start
+Here's how to use the model for text generation:
+```python
+import torch
+import sentencepiece as sp
+from huggingface_hub import hf_hub_download
+import numpy as np
+# Step 1: Download and load the tokenizer
+tokenizer_path = hf_hub_download(
+    repo_id="rupakrpk93/odia_tokenizers_test",
+    filename="odia_tokenizer.model"
+)
+tokenizer = sp.SentencePieceProcessor()
+tokenizer.load(tokenizer_path)
+# Step 2: Download model files
+model_path = hf_hub_download(
+    repo_id="rupakrpk93/odia_tokenizers_test",
+    filename="pytorch_model.bin"
+)
+config_path = hf_hub_download(
+    repo_id="rupakrpk93/odia_tokenizers_test",
+    filename="config.json"
+)
+# Step 3: Load the model (you need the model class definition)
+# Note: You'll need to define the GPT model architecture
+# The model architecture code is available in the repository
+# Step 4: Generate text
+def generate_odia_text(prompt, max_length=100):
+    # Encode the prompt
+    input_ids = tokenizer.encode_as_ids(prompt)
+    input_tensor = torch.tensor(input_ids).unsqueeze(0)
+    # Generate (assuming model is loaded)
+    # output = model.generate(input_tensor, max_length)
+    # Decode the output
+    # generated_text = tokenizer.decode(output.squeeze().tolist())
+    # return generated_text
+    pass
+```
+### Example Usage
+```python
+# Example 1: Simple text generation
+prompt = "ବର୍ଷା"
+# generated_text = generate_odia_text(prompt, max_length=200)
+# print(generated_text)
+# Example 2: Encode and decode text
+text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
+encoded = tokenizer.encode_as_ids(text)
+print(f"Encoded: {encoded}")
+decoded = tokenizer.decode(encoded)
+print(f"Decoded: {decoded}")
+```
+### Full Implementation Example
+For a complete working example with the model architecture:
+```python
+# The full model architecture and implementation
+# is available in the repository files.
+# Please refer to the model implementation for complete code.
+```
 ## Training Details
+### Training Hyperparameters
 - **Max Iterations**: 40,000
+- **Learning Rate**: 3e-4 with cosine decay
 - **Batch Size**: 16
+- **Gradient Accumulation Steps**: 8
+- **Warmup Steps**: 2,000
+- **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
+- **Mixed Precision**: bfloat16/float16
+### Training Data
+The model was trained on a combination of:
+1. **OdiaGenAIdata/fine_web2_odia_pt** - High-quality Odia web text
+2. **bigscience-data/roots_indic-or_indic_nlp_corpus** - Odia corpus from Indic NLP
+Total training samples: ~3.8M texts
+## Limitations
+- Maximum context length is 256 tokens
+- Trained specifically on Odia text, may not perform well on other languages
+- May generate repetitive text for very long sequences
+- The model requires the custom GPT architecture code to run
+## Intended Use
+This model is intended for:
+- Odia text generation
+- Odia language research
+- Educational purposes
+- Building Odia language applications
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{odia_gpt_2024,
+  title={Odia GPT Language Model},
+  author={Your Name},
+  year={2024},
+  publisher={HuggingFace}
+}
+```
+## Contact
+For questions and feedback, please open an issue on the model repository.