CultriX commited on
Commit
fabc341
·
verified ·
1 Parent(s): e7fc387

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ imatrix.dat filter=lfs diff=lfs merge=lfs -text
37
+ qwen3-8b-hippocratesv1.fp16.gguf filter=lfs diff=lfs merge=lfs -text
38
+ qwen3-8b-hippocratesv1.q2_k-imat-00001-of-00002.gguf filter=lfs diff=lfs merge=lfs -text
39
+ qwen3-8b-hippocratesv1.q2_k-imat-00002-of-00002.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - FreedomIntelligence/medical-o1-reasoning-SFT
5
+ - unsloth/OpenMathReasoning-mini
6
+ - mlabonne/guanaco-llama2-1k
7
+ - madrylab/gsm8k-platinum
8
+ language:
9
+ - en
10
+ - es
11
+ base_model:
12
+ - Qwen/Qwen3-8B
13
+ pipeline_tag: text-generation
14
+ library_name: transformers
15
+ tags:
16
+ - medical
17
+ - grpo
18
+ - math
19
+ - reasoning
20
+ - fine-tuned
21
+ - qlora
22
+ - lora
23
+ - multi-stage-finetuning
24
+ - autoquant
25
+ - gguf
26
+ ---
27
+
28
+ # Qwen3-8B-MultiStage-Finetune-Hybrid
29
+
30
+ ## Model Description
31
+
32
+ This is a **fine-tuned** version of the **Qwen/Qwen3-8B** large language model. It's specialized through a multi-stage training pipeline focusing on **medical reasoning**, **mathematical problem-solving**, and **general conversational abilities**. The model was trained using **QLoRA** (Quantized Low-Rank Adaptation) and **GRPO** (Generative Reinforcement Learning with Policy Optimization) for both efficiency and enhanced performance in its specialized domains.
33
+
34
+ The training methodology uses a progressive approach, building capabilities in distinct areas before consolidating them:
35
+
36
+ 1. **Medical Reasoning SFT:** Initial fine-tuning on a specialized medical dataset to adapt the model to medical explanations and reasoning.
37
+ 2. **Mathematical SFT:** Further fine-tuning on a mathematical dataset to enhance its ability to solve math problems.
38
+ 3. **Mathematical GRPO:** A reinforcement learning stage that leverages a reward function to optimize the model's accuracy and ability to provide structured mathematical solutions, particularly with answers in a `\boxed{}` format.
39
+ 4. **General Chat SFT:** Final fine-tuning on a diverse chat dataset to improve conversational fluency, helpfulness, and alignment with common dialogue patterns.
40
+
41
+ ## Training Details
42
+
43
+ ### Training Data
44
+
45
+ The model was trained on a carefully selected set of public datasets:
46
+
47
+ * **Medical Reasoning:** `FreedomIntelligence/medical-o1-reasoning-SFT`
48
+ * **Mathematical Reasoning:** `unsloth/OpenMathReasoning-mini`
49
+ * **General Conversation:** `mlabonne/guanaco-llama2-1k`
50
+
51
+ ### Training Procedure
52
+
53
+ The model was fine-tuned using a hybrid approach that combines the efficient training capabilities of the `unsloth` library with the advanced reinforcement learning features of `trl`:
54
+
55
+ * **Base Model:** Qwen/Qwen3-8B
56
+ * **Quantization:** 4-bit NormalFloat (NF4) with double quantization enabled (`bnb_4bit_use_double_quant=True`), allowing for efficient training on limited GPU memory.
57
+ * **LoRA Configuration:** A rank of `r=24`, `lora_alpha=32`, and `lora_dropout=0.05` was applied. Key attention and feed-forward projection layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) were targeted for adaptation.
58
+ * **Gradient Checkpointing:** Enabled for memory efficiency, with `recompute_grad=True` for Unsloth-specific optimizations.
59
+ * **Dynamic Hyperparameters:** Batch sizes and gradient accumulation steps were adaptively adjusted per training stage to optimize GPU memory utilization and training throughput.
60
+ * **Learning Rate Schedule:** A cosine decay schedule with a warmup ratio was used. Learning rates were customized for each training stage.
61
+ * **Optimizer:** `adamw_8bit`.
62
+ * **Regularization:** Gradient norm clipping (`max_grad_norm=1.0`) and weight decay (`0.01`) were applied to prevent exploding gradients and overfitting.
63
+ * **Early Stopping:** Applied during SFT stages with a patience of 2 on validation loss, stopping training if no significant improvement was observed.
64
+ * **Hardware:** Training was performed on a single GPU.
65
+ * **Software Stack:** Python, Hugging Face `transformers`, `unsloth`, `trl`, `datasets`, `wandb` (for experiment tracking), and `vllm` (used during the GRPO stage for efficient text generation).
66
+
67
+ ## Usage
68
+
69
+ This model is designed for **text generation**, particularly in response to chat-based prompts or specific medical and mathematical queries. To get the best results, ensure your prompts are formatted correctly following the model's training structure.
70
+
71
+ ### Load the Model
72
+
73
+ ```python
74
+ import torch
75
+ from transformers import AutoTokenizer, BitsAndBytesConfig
76
+ from unsloth import FastLanguageModel
77
+
78
+ # Configuration parameters (matching training)
79
+ MAX_SEQ_LENGTH = 2048
80
+ LOAD_IN_4BIT = True
81
+ USE_DOUBLE_QUANT = True
82
+
83
+ # Initialize BitsAndBytesConfig as used during training
84
+ bnb_config = BitsAndBytesConfig(
85
+ load_in_4bit=LOAD_IN_4BIT,
86
+ bnb_4bit_quant_type="nf4",
87
+ bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
88
+ bnb_4bit_use_double_quant=USE_DOUBLE_QUANT,
89
+ )
90
+
91
+ # Replace with the actual path to your uploaded model on Hugging Face Hub
92
+ model_id = "your-huggingface-username/Qwen3-8B-MultiStage-Finetune-Hybrid"
93
+
94
+ # Load tokenizer
95
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
96
+ if tokenizer.pad_token is None:
97
+ tokenizer.pad_token = tokenizer.eos_token # Ensure pad token is set for generation
98
+
99
+ # Load the model using Unsloth's optimized loading
100
+ model = FastLanguageModel.from_pretrained(
101
+ model_id,
102
+ quantization_config=bnb_config,
103
+ max_seq_length=MAX_SEQ_LENGTH,
104
+ device_map="auto", # Automatically maps model to available GPUs
105
+ )
106
+
107
+ # Example for a general chat interaction
108
+ messages = [
109
+ {"role": "system", "content": "You are a friendly and helpful assistant."},
110
+ {"role": "user", "content": "Tell me a short, funny story about a clumsy robot."},
111
+ ]
112
+
113
+ # Apply the chat template and tokenize inputs
114
+ input_ids = tokenizer.apply_chat_template(
115
+ messages,
116
+ tokenize=True,
117
+ add_generation_prompt=True, # Important: Add the prompt for the assistant's turn
118
+ return_tensors="pt"
119
+ ).to("cuda") # Move inputs to GPU
120
+
121
+ # Generate outputs
122
+ outputs = model.generate(
123
+ input_ids,
124
+ max_new_tokens=512, # Maximum tokens to generate
125
+ do_sample=True, # Enable sampling for more diverse outputs
126
+ temperature=0.7, # Control randomness
127
+ top_p=0.95 # Nucleus sampling
128
+ )
129
+
130
+ # Decode and print the generated text, skipping special tokens
131
+ print("--- General Chat Example ---")
132
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
133
+
134
+ # Example for a math problem (model is trained to provide a \boxed{} answer)
135
+ math_messages = [
136
+ {"role": "system", "content": "You are a math solver. Provide your reasoning within \\ and the final answer in \\boxed{} format."},
137
+ {"role": "user", "content": "If a car travels at 80 km/h for 2.5 hours, and then at 60 km/h for another 1.5 hours, what is the total distance traveled?"},
138
+ ]
139
+
140
+ # Apply math chat template and tokenize inputs
141
+ math_input_ids = tokenizer.apply_chat_template(
142
+ math_messages,
143
+ tokenize=True,
144
+ add_generation_prompt=True,
145
+ return_tensors="pt"
146
+ ).to("cuda")
147
+
148
+ # Generate outputs for the math problem
149
+ math_outputs = model.generate(
150
+ math_input_ids,
151
+ max_new_tokens=512,
152
+ do_sample=True,
153
+ temperature=0.6, # Slightly lower temperature for more deterministic math outputs
154
+ top_p=0.9
155
+ )
156
+
157
+ print("\n--- Math Example ---")
158
+ print(tokenizer.decode(math_outputs[0], skip_special_tokens=True))
159
+
160
+
161
+ Limitations and Bias
162
+ As a large language model, this fine-tuned Qwen-8B model inherits general limitations and potential biases from its extensive pre-training and fine-tuning data:
163
+
164
+ Hallucinations: The model may generate information that is factually incorrect or nonsensical. Always cross-reference critical information.
165
+ Factual Accuracy: While specialized in medical and mathematical domains, it should not be used as a substitute for professional medical advice, complex mathematical proofs, or any domain requiring absolute precision without independent verification.
166
+ Bias: The model's outputs are influenced by the biases present in its training data (both the base model's pre-training and the fine-tuning datasets). This may manifest in stereotypical, harmful, or unfair content.
167
+ Language Proficiency: Primarily trained on English text. While some Spanish content was present in the general chat dataset, its proficiency in Spanish or other languages is not guaranteed and may vary.
168
+ Context Window: Limited by its max_seq_length (2048 tokens). Very long inputs or extensive multi-turn conversations might lead to degraded performance or truncation of context.
169
+ Ethical Considerations
170
+ Users should be aware of the following ethical considerations when deploying or using this model:
171
+
172
+ Not for Critical Applications: This model is intended for research, experimentation, and exploratory applications. It is not designed or validated for use in critical systems where accuracy, reliability, and safety are paramount (e.g., medical diagnosis, financial advice, legal counsel, or decision-making systems impacting individuals).
173
+ Responsible AI Use: Deploy and use this model responsibly, adhering to ethical AI guidelines and principles. Implement safeguards to monitor its outputs and prevent potential misuse, discrimination, or the generation of harmful content.
174
+ Data Privacy and Security: Do not use this model with sensitive personal identifiable information (PII) or confidential data. Ensure compliance with all relevant data privacy regulations.
175
+ Transparency: Be transparent with end-users when they are interacting with an AI system.
176
+ Citation
177
+ If you use this model or the training methodology, please consider citing the following key components:
178
+
179
+ Code snippet
180
+
181
+ @misc{qwen3,
182
+ author = {Qwen Team},
183
+ title = {Qwen3-8B},
184
+ year = {2024},
185
+ publisher = {Hugging Face},
186
+ howpublished = {\url{[https://huggingface.co/Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)}}
187
+ }
188
+
189
+ @misc{unsloth,
190
+ author = {Daniel Han},
191
+ title = {Unsloth: Fast LLM Fine-tuning},
192
+ year = {2023},
193
+ publisher = {GitHub},
194
+ howpublished = {\url{[https://github.com/unsloth/unsloth](https://github.com/unsloth/unsloth)}}
195
+ }
196
+
197
+ @misc{trl,
198
+ author = {Hugging Face Team},
199
+ title = {TRL: Transformer Reinforcement Learning},
200
+ year = {2023},
201
+ publisher = {GitHub},
202
+ howpublished = {\url{[https://github.com/huggingface/trl](https://github.com/huggingface/trl)}}
203
+ }
204
+
205
+ @misc{medical_dataset,
206
+ author = {FreedomIntelligence},
207
+ title = {medical-o1-reasoning-SFT},
208
+ year = {2024},
209
+ publisher = {Hugging Face},
210
+ howpublished = {\url{[https://huggingface.co/FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/FreedomIntelligence/medical-o1-reasoning-SFT)}}
211
+ }
212
+
213
+ @misc{openmathreasoning_mini,
214
+ author = {unsloth},
215
+ title = {OpenMathReasoning-mini},
216
+ year = {2023},
217
+ publisher = {Hugging Face},
218
+ howpublished = {\url{[https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini](https://huggingface.co/datasets/unsloth/OpenMathReasoning-mini)}}
219
+ }
220
+
221
+ @misc{guanaco_llama2_1k,
222
+ author = {mlabonne},
223
+ title = {guanaco-llama2-1k},
224
+ year = {2023},
225
+ publisher = {Hugging Face},
226
+ howpublished = {\url{[https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)}}
227
+ }
imatrix.dat ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:071db6d72fa37286b3598ff120179f4848fa87cb347adfc861ccafaa061495ac
3
+ size 5316789
qwen3-8b-hippocratesv1.fp16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3b55021a543dd6d4d37eaa76e5a7da5cc0ca945c9138f46ecca5320cdf5d9dd5
3
+ size 16388045056
qwen3-8b-hippocratesv1.q2_k-imat-00001-of-00002.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5efd453b61a2e7ae17064405c3649b26ad9950f0332f6f69cf7c9fb6ee86c1e9
3
+ size 2356901152
qwen3-8b-hippocratesv1.q2_k-imat-00002-of-00002.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:248147f1df17c58809abc859660f75566b2ef81245a092f7039db2fd7af723d3
3
+ size 924833184