Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ For SFT stage we using the hyperparameters:
|
|
24 |
- Learning-Rate: 5e-5.
|
25 |
- Number Of Epoch: 6.
|
26 |
|
27 |
-
For RL stage we
|
28 |
|
29 |
- Max prompt length: 2048 tokens.
|
30 |
- Max response length: 12288 tokens.
|
|
|
24 |
- Learning-Rate: 5e-5.
|
25 |
- Number Of Epoch: 6.
|
26 |
|
27 |
+
For the Reinforcement Learning (RL) stage, we designed a two-stage training process. The first stage focuses on enhancing the model's reasoning capabilities for complex medical questions. The second stage ensures that the model's responses prioritize safety and helpfulness. Both stages utilize the following configuration:
|
28 |
|
29 |
- Max prompt length: 2048 tokens.
|
30 |
- Max response length: 12288 tokens.
|