rupakrpk93 commited on
Commit
ee7b978
·
verified ·
1 Parent(s): f178917

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -5,15 +5,18 @@ tags:
5
  - odia
6
  - language-model
7
  - text-generation
 
8
  datasets:
9
  - OdiaGenAIdata/fine_web2_odia_pt
10
  - bigscience-data/roots_indic-or_indic_nlp_corpus
 
 
11
  ---
12
 
13
  # Odia Language Model (odia_tokenizers_test)
14
 
15
  ## Model Description
16
- This is a GPT-based language model specifically trained for Odia language text generation.
17
 
18
  ### Model Architecture
19
  - **Vocabulary Size**: 50,000 tokens
@@ -21,9 +24,139 @@ This is a GPT-based language model specifically trained for Odia language text g
21
  - **Number of Layers**: 24
22
  - **Number of Heads**: 12
23
  - **Hidden Size**: 768
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Training Details
 
 
26
  - **Max Iterations**: 40,000
27
- - **Learning Rate**: 3e-4
28
  - **Batch Size**: 16
29
- - **Optimizer**: AdamW
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - odia
6
  - language-model
7
  - text-generation
8
+ - causal-lm
9
  datasets:
10
  - OdiaGenAIdata/fine_web2_odia_pt
11
  - bigscience-data/roots_indic-or_indic_nlp_corpus
12
+ widget:
13
+ - text: "ଓଡିଆ ଭାଷା"
14
  ---
15
 
16
  # Odia Language Model (odia_tokenizers_test)
17
 
18
  ## Model Description
19
+ This is a GPT-based language model specifically trained for Odia language text generation. The model can generate coherent Odia text and continue prompts in a contextually appropriate manner.
20
 
21
  ### Model Architecture
22
  - **Vocabulary Size**: 50,000 tokens
 
24
  - **Number of Layers**: 24
25
  - **Number of Heads**: 12
26
  - **Hidden Size**: 768
27
+ - **Parameters**: ~354M
28
+
29
+ ## Installation
30
+
31
+ First, install the required dependencies:
32
+
33
+ ```bash
34
+ pip install torch sentencepiece huggingface-hub
35
+ ```
36
+
37
+ ## Usage
38
+
39
+ ### Quick Start
40
+
41
+ Here's how to use the model for text generation:
42
+
43
+ ```python
44
+ import torch
45
+ import sentencepiece as sp
46
+ from huggingface_hub import hf_hub_download
47
+ import numpy as np
48
+
49
+ # Step 1: Download and load the tokenizer
50
+ tokenizer_path = hf_hub_download(
51
+ repo_id="rupakrpk93/odia_tokenizers_test",
52
+ filename="odia_tokenizer.model"
53
+ )
54
+
55
+ tokenizer = sp.SentencePieceProcessor()
56
+ tokenizer.load(tokenizer_path)
57
+
58
+ # Step 2: Download model files
59
+ model_path = hf_hub_download(
60
+ repo_id="rupakrpk93/odia_tokenizers_test",
61
+ filename="pytorch_model.bin"
62
+ )
63
+
64
+ config_path = hf_hub_download(
65
+ repo_id="rupakrpk93/odia_tokenizers_test",
66
+ filename="config.json"
67
+ )
68
+
69
+ # Step 3: Load the model (you need the model class definition)
70
+ # Note: You'll need to define the GPT model architecture
71
+ # The model architecture code is available in the repository
72
+
73
+ # Step 4: Generate text
74
+ def generate_odia_text(prompt, max_length=100):
75
+ # Encode the prompt
76
+ input_ids = tokenizer.encode_as_ids(prompt)
77
+ input_tensor = torch.tensor(input_ids).unsqueeze(0)
78
+
79
+ # Generate (assuming model is loaded)
80
+ # output = model.generate(input_tensor, max_length)
81
+
82
+ # Decode the output
83
+ # generated_text = tokenizer.decode(output.squeeze().tolist())
84
+ # return generated_text
85
+ pass
86
+ ```
87
+
88
+ ### Example Usage
89
+
90
+ ```python
91
+ # Example 1: Simple text generation
92
+ prompt = "ବର୍ଷା"
93
+ # generated_text = generate_odia_text(prompt, max_length=200)
94
+ # print(generated_text)
95
+
96
+ # Example 2: Encode and decode text
97
+ text = "ଓଡିଆ ଭାଷା ଏକ ସୁନ୍ଦର ଭାଷା"
98
+ encoded = tokenizer.encode_as_ids(text)
99
+ print(f"Encoded: {encoded}")
100
+
101
+ decoded = tokenizer.decode(encoded)
102
+ print(f"Decoded: {decoded}")
103
+ ```
104
+
105
+ ### Full Implementation Example
106
+
107
+ For a complete working example with the model architecture:
108
+
109
+ ```python
110
+ # The full model architecture and implementation
111
+ # is available in the repository files.
112
+ # Please refer to the model implementation for complete code.
113
+ ```
114
 
115
  ## Training Details
116
+
117
+ ### Training Hyperparameters
118
  - **Max Iterations**: 40,000
119
+ - **Learning Rate**: 3e-4 with cosine decay
120
  - **Batch Size**: 16
121
+ - **Gradient Accumulation Steps**: 8
122
+ - **Warmup Steps**: 2,000
123
+ - **Optimizer**: AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
124
+ - **Mixed Precision**: bfloat16/float16
125
+
126
+ ### Training Data
127
+ The model was trained on a combination of:
128
+ 1. **OdiaGenAIdata/fine_web2_odia_pt** - High-quality Odia web text
129
+ 2. **bigscience-data/roots_indic-or_indic_nlp_corpus** - Odia corpus from Indic NLP
130
+
131
+ Total training samples: ~3.8M texts
132
+
133
+ ## Limitations
134
+
135
+ - Maximum context length is 256 tokens
136
+ - Trained specifically on Odia text, may not perform well on other languages
137
+ - May generate repetitive text for very long sequences
138
+ - The model requires the custom GPT architecture code to run
139
+
140
+ ## Intended Use
141
+
142
+ This model is intended for:
143
+ - Odia text generation
144
+ - Odia language research
145
+ - Educational purposes
146
+ - Building Odia language applications
147
+
148
+ ## Citation
149
+
150
+ If you use this model, please cite:
151
+ ```bibtex
152
+ @misc{odia_gpt_2024,
153
+ title={Odia GPT Language Model},
154
+ author={Your Name},
155
+ year={2024},
156
+ publisher={HuggingFace}
157
+ }
158
+ ```
159
+
160
+ ## Contact
161
+
162
+ For questions and feedback, please open an issue on the model repository.