kundeshwar20 commited on
Commit
0b80084
·
verified ·
1 Parent(s): b069347

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -0
README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+ <img src="https://huggingface.co/bharatgenai/Param-1-2.9B-Instruct/resolve/main/BharatGen%20Logo%20(1).png" width="60%" alt="BharatGen" />
3
+ </div>
4
+ <hr>
5
+ <div align="center">
6
+ <a href="#" style="margin: 4px; pointer-events: none; cursor: default;">
7
+ <img alt="Paper" src="https://img.shields.io/badge/Paper-Coming%20Soon-lightgrey?style=flat" />
8
+ </a>
9
+ <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" style="margin: 4px;">
10
+ <img alt="License" src="https://img.shields.io/badge/License-CC--BY--4.0-blue.svg" />
11
+ </a>
12
+ <a href="#" target="_blank" style="margin: 4px;">
13
+ <img alt="Blog" src="https://img.shields.io/badge/Blog-Read%20More-brightgreen?style=flat" />
14
+ </a>
15
+ </div>
16
+
17
+ # AyurParam
18
+ BharatGen introduces AyurParam, a domain-specialized large language model fine-tuned from Param-1-2.9B-Instruct on a high-quality Ayurveda dataset. It is designed to handle Ayurvedic queries, classical text interpretation, clinical guidance, and wellness knowledge. Ayurveda offers vast traditional medical wisdom, yet most language models lack domain-specific understanding. AyurParam bridges this gap by combining Param-1’s bilingual strengths with a curated Ayurvedic knowledge base, enabling contextually rich and culturally grounded responses.
19
+
20
+ ## 🏗 Model Architecture
21
+ AyurParam inherits the architecture of Param-1-2.9B-Instruct:
22
+ * Hidden size: 204
23
+ * Intermediate size: 7168
24
+ * Attention heads: 16
25
+ * Hidden layers: 32
26
+ * Key-value heads: 8
27
+ * Max position embeddings: 2048
28
+ * Activation: SiLU
29
+ * Positional Embeddings: Rotary (RoPE, theta=10000)
30
+ * Attention Mechanism: Grouped-query attention
31
+ * Precision: bf16-mixed
32
+ * Base model: Param-1-2.9B-Instruct
33
+
34
+ ## 📚 Data Preparation
35
+ AyurParam’s training corpus was carefully crafted to ensure deep Ayurvedic knowledge, Sanskrit/English bilingual accessibility, and clinical relevance.
36
+ Steps involved:
37
+ 1. Source Gathering
38
+ * 15k+ passages from classical Ayurvedic texts (digitized and curated).
39
+ * 10k+ passages from AYUSH ministry guidelines, research papers, and clinical case discussions.
40
+ 2. Question Generation
41
+ * 5 curated Q&A pairs generated per passage using an open-source LLM + domain expert review.
42
+ 3. Domain Taxonomy & Personas
43
+ * Built an Ayurveda-specific taxonomy (Dosha, Dhatu, Mala, Srotas, Nidana, Chikitsa, etc.).
44
+ * Defined multiple personas: student, vaidya (physician), researcher, policymaker, wellness coach.
45
+ 4. Dataset Construction
46
+ * 1.5M Q&A pairs grounded in taxonomy and personas.
47
+ * 4M multi-turn conversation samples created.
48
+ * Sanskrit terminology preserved with transliteration and explanations.
49
+
50
+
51
+ ## 🏋️ Training Setup
52
+ * Base model: Param-1-2.9B-Instruct
53
+ * Training framework: Hugging Face + TRL (SFT) + torchrun multi-node setup
54
+ * Prompt template: Custom-designed for Ayurvedic inference
55
+ * Scheduler: Linear with warmup
56
+ * Epochs: 3
57
+ * Total training samples: ~8M
58
+ * Test samples: ~800k
59
+ * Base learning rate: 5e-6
60
+ * Minimum learning rate: 0
61
+ * Additional tokens: <user>, <assistant>, <context>, <system_prompt>
62
+ * Vocab size: 256k + 4
63
+ * Global batch size: 1024
64
+ * Micro batch size: 4
65
+ * Gradient accumulation steps: 32
66
+
67
+
68
+ ## 🚀 Inference Example
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForCausalLM
71
+ import torch
72
+
73
+ model_name = "bharatgenai/AyurParam"
74
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False)
75
+ model = AutoModelForCausalLM.from_pretrained(
76
+ model_name,
77
+ trust_remote_code=True,
78
+ torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.bfloat32,
79
+ device_map="auto"
80
+ )
81
+
82
+ # Example Ayurvedic query
83
+ user_input = "What is the Samprapti (pathogenesis) of Amavata according to Ayurveda?"
84
+
85
+ # Prompt styles
86
+ # 1. Generic QA: <user> ... <assistant>
87
+ # 2. Context-based QA: <context> ... <user> ... <assistant>
88
+ # 3. Multi-turn conversation (supports up to 5 turns): <user> ... <assistant> ... <user> ... <assistant>
89
+
90
+ prompt = f"<user> {user_input} <assistant>"
91
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
92
+
93
+ with torch.no_grad():
94
+ output = model.generate(
95
+ **inputs,
96
+ max_new_tokens=300,
97
+ do_sample=True,
98
+ top_k=50,
99
+ top_p=0.95,
100
+ temperature=0.6,
101
+ eos_token_id=tokenizer.eos_token_id,
102
+ use_cache=False
103
+ )
104
+
105
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
106
+ ```
107
+
108
+
109
+ ## 📊 Benchmark Results: Ayur Param vs Baselines
110
+ - [BhashaBench-Ayur benchmark]([https://github.com/BharatGen-IITB-TIH/BhashaBench](https://huggingface.co/datasets/bharatgenai/BhashaBench-Ayur))
111
+ ---
112
+
113
+ ## 1. Overall Performance
114
+
115
+ ### Similar Range Models
116
+ | Model | bba | bba_English | bba_Hindi |
117
+ |-----------------------|-------|-------------|-----------|
118
+ | Llama-3.2-1B-Instruct | 26.41 | 26.77 | 25.82 |
119
+ | Qwen2.5-3B-Instruct | 32.68 | 35.22 | 28.46 |
120
+ | granite-3.1-2b | 31.10 | 33.39 | 27.30 |
121
+ | Llama-3.2-3B-Instruct | 33.20 | 35.31 | 29.67 |
122
+ | gemma-2-2b-it | 28.40 | 29.38 | 26.79 |
123
+ | **AyurParam** | **39.97** | **41.12** | **38.04** |
124
+
125
+ ### Larger Models
126
+ | Model | bba | bba_English | bba_Hindi |
127
+ |-----------------------------------------|-------|-------------|-----------|
128
+ | Indic-gemma-7B-Navarasa-2.0 | 35.13 | 37.12 | 31.83 |
129
+ | Pangea-7B | 37.41 | 40.69 | 31.93 |
130
+ | aya-23-8B | 31.97 | 33.84 | 28.87 |
131
+ | gpt-oss-20b | 36.34 | 38.30 | 33.09 |
132
+ | Llama-3.1-8B-Instruct | 34.76 | 36.86 | 31.26 |
133
+ | gemma-2-27b-it | 37.99 | 40.45 | 33.89 |
134
+ | Nemotron-4-Mini-Hindi-4B-Instruct | 33.54 | 33.38 | 33.82 |
135
+ | **AyurParam** | **39.97** | **41.12** | **38.04** |
136
+
137
+ ---
138
+
139
+ ## 2. Question Difficulty
140
+
141
+ ### Similar Range Models
142
+ | Difficulty | Llama-3.2-1B | Qwen2.5-3B | granite-3.1-2b | Llama-3.2-3B | gemma-2-2b-it | **AyurParam** |
143
+ |------------|--------------|------------|----------------|--------------|---------------|----------------|
144
+ | **Easy** | 27.44 | 35.55 | 33.90 | 36.42 | 29.96 | **43.93** |
145
+ | **Medium** | 25.23 | 29.57 | 28.06 | 29.66 | 26.83 | **35.95** |
146
+ | **Hard** | 25.39 | 28.23 | 26.81 | 28.51 | 24.96 | **31.21** |
147
+
148
+ ### Larger Models
149
+ | Difficulty | Indic-gemma-7B | Pangea-7B | aya-23-8B | gpt-oss-20b | Llama-3.1-8B | gemma-2-27b-it | Nemotron-4-Mini-Hindi-4B | **AyurParam** |
150
+ |------------|----------------|-----------|-----------|-------------|--------------|----------------|--------------------------|----------------|
151
+ | **Easy** | 38.54 | 41.45 | 35.51 | 42.03 | 39.43 | 43.47 | 36.08 |**43.93** |
152
+ | **Medium** | 31.72 | 32.94 | 28.29 | 30.27 | 29.36 | 31.90 | 30.80 |**35.95** |
153
+ | **Hard** | 27.23 | 31.77 | 25.11 | 26.67 | 30.50 | 30.78 | 29.50 |**31.21** |
154
+
155
+ ---
156
+
157
+ ## 3. Question Type
158
+
159
+ ### Similar Range Models
160
+ | Type | Llama-3.2-1B | Qwen2.5-3B | granite-3.1-2b | Llama-3.2-3B | gemma-2-2b-it | **AyurParam** |
161
+ |----------------------|--------------|------------|----------------|--------------|---------------|----------------|
162
+ | Assertion/Reasoning | 59.26 | 51.85 | 33.33 | 40.74 | 33.33 | **44.44** |
163
+ | Fill in the blanks | 26.97 | 29.21 | 21.35 | 34.83 | 32.02 | **29.78** |
164
+ | MCQ | 26.34 | 32.70 | 31.22 | 33.17 | 28.33 | **40.12** |
165
+ | Match the column | 26.83 | 29.27 | 29.27 | 29.27 | 36.59 | **24.39** |
166
+
167
+ ### Larger Models
168
+ | Type | Indic-gemma-7B | Pangea-7B | aya-23-8B | gpt-oss-20b | Llama-3.1-8B | gemma-2-27b-it | Nemotron-4-Mini-Hindi-4B | **AyurParam** |
169
+ |----------------------|----------------|-----------|-----------|-------------|--------------|----------------|--------------------------|----------------|
170
+ | Assertion/Reasoning | 59.26 | 62.96 | 18.52 | 25.93 | 29.63 | 55.56 | 37.04 | **44.44** |
171
+ | Fill in the blanks | 35.39 | 24.16 | 30.90 | 32.02 | 26.97 | 35.96 | 30.34 | **29.78** |
172
+ | MCQ | 35.10 | 37.53 | 32.05 | 36.39 | 34.83 | 37.98 | 33.60 | **40.12** |
173
+ | Match the column | 31.71 | 34.15 | 17.07 | 46.34 | 46.34 | 39.02 | 24.39 | **24.39** |
174
+
175
+ ---
176
+ From the above results, **Ayur Param not only outperforms all similar-sized models** but also achieves **competitive or better performance than larger models** across multiple metrics.
177
+
178
+
179
+ ## License
180
+ This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
181
+
182
+ ## Contact
183
+ For any questions or feedback, please contact:
184
+ - Sravan Kumar ([email protected])
185
+ - Kundeshwar Pundalik ([email protected])