Update README.md
Browse files
README.md
CHANGED
|
@@ -7,6 +7,9 @@ base_model:
|
|
| 7 |
- Qwen/Qwen2.5-0.5B
|
| 8 |
---
|
| 9 |
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
prompt format:
|
| 12 |
```
|
|
@@ -56,9 +59,10 @@ print(pipe(prompt)[0]["generated_text"][len(prompt):])
|
|
| 56 |
|
| 57 |
## Training information
|
| 58 |
|
|
|
|
| 59 |
- Device: 1x A100 80G
|
| 60 |
- GPU Hour: about 1 hour
|
| 61 |
-
-
|
| 62 |
|
| 63 |
Wandb log: https://wandb.ai/p1atdev/grpo-math-01/runs/ytv8wxll
|
| 64 |
|
|
|
|
| 7 |
- Qwen/Qwen2.5-0.5B
|
| 8 |
---
|
| 9 |
|
| 10 |
+
簡単な算数問題を解けるように GRPO で学習してみた。学習コードは下の方にあります。
|
| 11 |
+
|
| 12 |
+
学習データは簡単な問題なのでその場で合成したものを使いました。(コード参照)
|
| 13 |
|
| 14 |
prompt format:
|
| 15 |
```
|
|
|
|
| 59 |
|
| 60 |
## Training information
|
| 61 |
|
| 62 |
+
- Base model: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
|
| 63 |
- Device: 1x A100 80G
|
| 64 |
- GPU Hour: about 1 hour
|
| 65 |
+
- Total training steps: 140 steps ([the last checkpoint](https://huggingface.co/p1atdev/qwen2.5-0.5b-grpo-math-01/blob/9ede090f5ed41d88c16ffbc56a81b0772f19679e/model.safetensors))
|
| 66 |
|
| 67 |
Wandb log: https://wandb.ai/p1atdev/grpo-math-01/runs/ytv8wxll
|
| 68 |
|