tsilva commited on
Commit
74068e9
Β·
verified Β·
1 Parent(s): d341237

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ artifacts/videos/eval/episodes/best_checkpoint.mp4 filter=lfs diff=lfs merge=lfs -text
artifacts/hyperparam_control/hyperparameters.json CHANGED
@@ -12,5 +12,5 @@
12
  "clip_range - PPO clipping range (PPO only)",
13
  "vf_coef - Value function coefficient (PPO only)"
14
  ],
15
- "last_modified": 1755378648.2642398
16
  }
 
12
  "clip_range - PPO clipping range (PPO only)",
13
  "vf_coef - Value function coefficient (PPO only)"
14
  ],
15
+ "last_modified": 1755381605.2407584
16
  }
artifacts/logs/training_20250816_230005_ppo_CartPole-v1.log ADDED
@@ -0,0 +1,897 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ === Training Session Started ===
2
+ Timestamp: 2025-08-16 23:00:05
3
+ Log file: runs/cvb5lyfw/logs/training_20250816_230005_ppo_CartPole-v1.log
4
+ Algorithm: ppo
5
+ Environment: CartPole-v1
6
+ Seed: 42
7
+ ==================================================
8
+
9
+ Configuration saved to: runs/cvb5lyfw/configs/config.json
10
+ /home/tsilva/repos/tsilva/gymnasium-solver/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
11
+
12
+ πŸŽ›οΈ Hyperparameter manual control enabled!
13
+ Control directory: runs/cvb5lyfw/hyperparam_control
14
+ Control file: hyperparameters.json
15
+ Edit this file to adjust hyperparameters during training.
16
+ /home/tsilva/repos/tsilva/gymnasium-solver/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
17
+ /home/tsilva/repos/tsilva/gymnasium-solver/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:310: The number of training batches (20) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
18
+ --------------------------------------------
19
+ | train/ | |
20
+ | ep_rew_mean | 23.52 |
21
+ | ep_len_mean | 23.00 |
22
+ | epoch | 8 |
23
+ | total_timesteps | 2560 |
24
+ | total_episodes | 107 |
25
+ | total_rollouts | 10.00 |
26
+ | rollout_timesteps | 256 |
27
+ | rollout_episodes | 11.00 |
28
+ | fps_instant | 4492 |
29
+ | rollout_fps | 21420.19 |
30
+ | loss | 9.12 |
31
+ | policy_loss | -0.0078 |
32
+ | value_loss | 18.2623 |
33
+ | entropy_loss | -0.6738 |
34
+ | action_mean | 0.51 |
35
+ | action_std | 0.50 |
36
+ | approx_kl | 0.0059 |
37
+ | baseline_mean | 0.00 |
38
+ | baseline_std | 0.00 |
39
+ | clip_fraction | 0.093 |
40
+ | clip_range | 0.1954 |
41
+ | entropy | 0.6738 |
42
+ | explained_variance | 0.258 |
43
+ | fps | 1080 |
44
+ | kl_div | 0.0031 |
45
+ | learning_rate | 0.000977 |
46
+ | obs_mean | -0.02 |
47
+ | obs_std | 0.45 |
48
+ | reward_mean | 1.00 |
49
+ | reward_std | 0.00 |
50
+ | time_elapsed | 2.37 |
51
+ --------------------------------------------
52
+ -----------------------------------------------
53
+ | train/ | |
54
+ | ep_rew_mean | 30.24 ↑6.72 |
55
+ | ep_len_mean | 30.00 ↑7.00 |
56
+ | epoch | 18 ↑10 |
57
+ | total_timesteps | 5120 ↑2560.00 |
58
+ | total_episodes | 178 ↑71 |
59
+ | total_rollouts | 20.00 ↑10.00 |
60
+ | rollout_timesteps | 256 β†’0 |
61
+ | rollout_episodes | 5.00 ↓6.00 |
62
+ | fps_instant | 4248 ↓244 |
63
+ | rollout_fps | 21941.32 ↑521.13 |
64
+ | loss | 7.85 ↓1.27 |
65
+ | policy_loss | 0.0070 ↑0.0148 |
66
+ | value_loss | 15.6912 ↓2.5711 |
67
+ | entropy_loss | -0.6368 ↑0.0370 |
68
+ | action_mean | 0.50 ↓0.01 |
69
+ | action_std | 0.50 ↑0.0001 |
70
+ | approx_kl | 0.0109 ↑0.0050 |
71
+ | baseline_mean | 0.00 β†’0 |
72
+ | baseline_std | 0.00 β†’0 |
73
+ | clip_fraction | 0.078 ↓0.015 |
74
+ | clip_range | 0.1903 ↓0.0051 |
75
+ | entropy | 0.6368 ↓0.0370 |
76
+ | explained_variance | 0.459 ↑0.201 |
77
+ | fps | 1723 ↑643 |
78
+ | kl_div | 0.0298 ↑0.0266 |
79
+ | learning_rate | 0.000951 ↓0.000026 |
80
+ | obs_mean | -0.00 ↑0.02 |
81
+ | obs_std | 0.43 ↓0.01 |
82
+ | reward_mean | 1.00 β†’0 |
83
+ | reward_std | 0.00 β†’0 |
84
+ | time_elapsed | 2.97 ↑0.60 |
85
+ -----------------------------------------------
86
+ ⚠️ ALGORITHM WARNING: Very negative explained variance (-0.423) indicates value function is performing poorly. Check value function architecture or learning rate.
87
+ -----------------------------------------------
88
+ | train/ | |
89
+ | ep_rew_mean | 45.78 ↑15.54 |
90
+ | ep_len_mean | 45.00 ↑15.00 |
91
+ | epoch | 28 ↑10 |
92
+ | total_timesteps | 7680 ↑2560.00 |
93
+ | total_episodes | 217 ↑39 |
94
+ | total_rollouts | 30.00 ↑10.00 |
95
+ | rollout_timesteps | 256 β†’0 |
96
+ | rollout_episodes | 1.00 ↓4.00 |
97
+ | fps_instant | 5165 ↑917 |
98
+ | rollout_fps | 22601.49 ↑660.17 |
99
+ | loss | 6.52 ↓1.33 |
100
+ | policy_loss | 0.0055 ↓0.0015 |
101
+ | value_loss | 13.0365 ↓2.6547 |
102
+ | entropy_loss | -0.6026 ↑0.0342 |
103
+ | action_mean | 0.50 ↑0.0040 |
104
+ | action_std | 0.50 ↑2.62e-06 |
105
+ | approx_kl | 0.0046 ↓0.0063 |
106
+ | baseline_mean | 0.00 β†’0 |
107
+ | baseline_std | 0.00 β†’0 |
108
+ | clip_fraction | 0.055 ↓0.023 |
109
+ | clip_range | 0.1852 ↓0.0051 |
110
+ | entropy | 0.6026 ↓0.0342 |
111
+ | explained_variance | -0.423 ↓0.882 |
112
+ | fps | 2161 ↑438 |
113
+ | kl_div | 0.0101 ↓0.0197 |
114
+ | learning_rate | 0.000926 ↓0.000026 |
115
+ | obs_mean | 0.02 ↑0.02 |
116
+ | obs_std | 0.46 ↑0.03 |
117
+ | reward_mean | 1.00 β†’0 |
118
+ | reward_std | 0.00 β†’0 |
119
+ | time_elapsed | 3.55 ↑0.58 |
120
+ -----------------------------------------------
121
+ -----------------------------------------------
122
+ | train/ | |
123
+ | ep_rew_mean | 63.52 ↑17.74 |
124
+ | ep_len_mean | 63.00 ↑18.00 |
125
+ | epoch | 38 ↑10 |
126
+ | total_timesteps | 10240 ↑2560.00 |
127
+ | total_episodes | 241 ↑24 |
128
+ | total_rollouts | 40.00 ↑10.00 |
129
+ | rollout_timesteps | 256 β†’0 |
130
+ | rollout_episodes | 2.00 ↑1.00 |
131
+ | fps_instant | 4253 ↓913 |
132
+ | rollout_fps | 22551.70 ↓49.80 |
133
+ | loss | 1.34 ↓5.19 |
134
+ | policy_loss | 0.0110 ↑0.0055 |
135
+ | value_loss | 2.6502 ↓10.3863 |
136
+ | entropy_loss | -0.6145 ↓0.0118 |
137
+ | action_mean | 0.50 ↑0.0025 |
138
+ | action_std | 0.50 ↓1.48e-05 |
139
+ | approx_kl | 0.0034 ↓0.0012 |
140
+ | baseline_mean | 0.00 β†’0 |
141
+ | baseline_std | 0.00 β†’0 |
142
+ | clip_fraction | 0.042 ↓0.012 |
143
+ | clip_range | 0.1800 ↓0.0051 |
144
+ | entropy | 0.6145 ↑0.0118 |
145
+ | explained_variance | 0.962 ↑1.385 |
146
+ | fps | 2449 ↑288 |
147
+ | kl_div | 0.0083 ↓0.0018 |
148
+ | learning_rate | 0.000900 ↓0.000026 |
149
+ | obs_mean | 0.06 ↑0.04 |
150
+ | obs_std | 0.51 ↑0.04 |
151
+ | reward_mean | 1.00 β†’0 |
152
+ | reward_std | 0.00 β†’0 |
153
+ | time_elapsed | 4.18 ↑0.63 |
154
+ -----------------------------------------------
155
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.144) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
156
+ -----------------------------------------------
157
+ | train/ | |
158
+ | ep_rew_mean | 80.62 ↑17.10 |
159
+ | ep_len_mean | 80.00 ↑17.00 |
160
+ | epoch | 48 ↑10 |
161
+ | total_timesteps | 12800 ↑2560.00 |
162
+ | total_episodes | 265 ↑24 |
163
+ | total_rollouts | 50.00 ↑10.00 |
164
+ | rollout_timesteps | 256 β†’0 |
165
+ | rollout_episodes | 2.00 β†’0 |
166
+ | fps_instant | 4129 ↓123 |
167
+ | rollout_fps | 22596.84 ↑45.14 |
168
+ | loss | 1.53 ↑0.19 |
169
+ | policy_loss | 0.0126 ↑0.0015 |
170
+ | value_loss | 3.0301 ↑0.3799 |
171
+ | entropy_loss | -0.5850 ↑0.0294 |
172
+ | action_mean | 0.51 ↑0.0027 |
173
+ | action_std | 0.50 ↓2.96e-05 |
174
+ | approx_kl | 0.0081 ↑0.0047 |
175
+ | baseline_mean | 0.00 β†’0 |
176
+ | baseline_std | 0.00 β†’0 |
177
+ | clip_fraction | 0.144 ↑0.102 |
178
+ | clip_range | 0.1749 ↓0.0051 |
179
+ | entropy | 0.5850 ↓0.0294 |
180
+ | explained_variance | 0.963 ↑0.001 |
181
+ | fps | 2669 ↑220 |
182
+ | kl_div | 0.0067 ↓0.0016 |
183
+ | learning_rate | 0.000875 ↓0.000026 |
184
+ | obs_mean | 0.09 ↑0.03 |
185
+ | obs_std | 0.54 ↑0.04 |
186
+ | reward_mean | 1.00 β†’0 |
187
+ | reward_std | 0.00 β†’0 |
188
+ | time_elapsed | 4.80 ↑0.61 |
189
+ -----------------------------------------------
190
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.186) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
191
+ -----------------------------------------------
192
+ | train/ | |
193
+ | ep_rew_mean | 95.26 ↑14.64 |
194
+ | ep_len_mean | 95.00 ↑15.00 |
195
+ | epoch | 58 ↑10 |
196
+ | total_timesteps | 15360 ↑2560.00 |
197
+ | total_episodes | 293 ↑28 |
198
+ | total_rollouts | 60.00 ↑10.00 |
199
+ | rollout_timesteps | 256 β†’0 |
200
+ | rollout_episodes | 6.00 ↑4.00 |
201
+ | fps_instant | 4169 ↑40 |
202
+ | rollout_fps | 22267.97 ↓328.87 |
203
+ | loss | 18.56 ↑17.03 |
204
+ | policy_loss | 0.0049 ↓0.0077 |
205
+ | value_loss | 37.1057 ↑34.0756 |
206
+ | entropy_loss | -0.5526 ↑0.0325 |
207
+ | action_mean | 0.51 ↑0.0018 |
208
+ | action_std | 0.50 ↓2.88e-05 |
209
+ | approx_kl | 0.0188 ↑0.0106 |
210
+ | baseline_mean | 0.00 β†’0 |
211
+ | baseline_std | 0.00 β†’0 |
212
+ | clip_fraction | 0.186 ↑0.042 |
213
+ | clip_range | 0.1698 ↓0.0051 |
214
+ | entropy | 0.5526 ↓0.0325 |
215
+ | explained_variance | 0.649 ↓0.313 |
216
+ | fps | 2828 ↑159 |
217
+ | kl_div | 0.0115 ↑0.0048 |
218
+ | learning_rate | 0.000849 ↓0.000026 |
219
+ | obs_mean | 0.12 ↑0.03 |
220
+ | obs_std | 0.56 ↑0.01 |
221
+ | reward_mean | 1.00 β†’0 |
222
+ | reward_std | 0.00 β†’0 |
223
+ | time_elapsed | 5.43 ↑0.64 |
224
+ -----------------------------------------------
225
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.340) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
226
+ -----------------------------------------------
227
+ | train/ | |
228
+ | ep_rew_mean | 104.72 ↑9.46 |
229
+ | ep_len_mean | 104.00 ↑9.00 |
230
+ | epoch | 68 ↑10 |
231
+ | total_timesteps | 17920 ↑2560.00 |
232
+ | total_episodes | 314 ↑21 |
233
+ | total_rollouts | 70.00 ↑10.00 |
234
+ | rollout_timesteps | 256 β†’0 |
235
+ | rollout_episodes | 2.00 ↓4.00 |
236
+ | fps_instant | 3896 ↓273 |
237
+ | rollout_fps | 22153.08 ↓114.89 |
238
+ | loss | 4.85 ↓13.70 |
239
+ | policy_loss | 0.0009 ↓0.0040 |
240
+ | value_loss | 9.7081 ↓27.3976 |
241
+ | entropy_loss | -0.5655 ↓0.0129 |
242
+ | action_mean | 0.51 ↑0.0024 |
243
+ | action_std | 0.50 ↓4.85e-05 |
244
+ | approx_kl | 0.0190 ↑0.0002 |
245
+ | baseline_mean | 0.00 β†’0 |
246
+ | baseline_std | 0.00 β†’0 |
247
+ | clip_fraction | 0.340 ↑0.154 |
248
+ | clip_range | 0.1647 ↓0.0051 |
249
+ | entropy | 0.5655 ↑0.0129 |
250
+ | explained_variance | 0.775 ↑0.126 |
251
+ | fps | 2954 ↑125 |
252
+ | kl_div | 0.0135 ↑0.0020 |
253
+ | learning_rate | 0.000823 ↓0.000026 |
254
+ | obs_mean | 0.13 ↑0.01 |
255
+ | obs_std | 0.57 ↑0.02 |
256
+ | reward_mean | 1.00 β†’0 |
257
+ | reward_std | 0.00 β†’0 |
258
+ | time_elapsed | 6.07 ↑0.64 |
259
+ -----------------------------------------------
260
+ ⚠️ ALGORITHM WARNING: High approximate KL divergence (0.1479) indicates large policy changes. Consider reducing learning rate.
261
+ ⚠️ ALGORITHM WARNING: High KL divergence (0.1581) indicates large policy changes. Consider reducing learning rate.
262
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.492) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
263
+ -----------------------------------------------
264
+ | train/ | |
265
+ | ep_rew_mean | 97.52 ↓7.20 |
266
+ | ep_len_mean | 97.00 ↓7.00 |
267
+ | epoch | 78 ↑10 |
268
+ | total_timesteps | 20480 ↑2560.00 |
269
+ | total_episodes | 347 ↑33 |
270
+ | total_rollouts | 80.00 ↑10.00 |
271
+ | rollout_timesteps | 256 β†’0 |
272
+ | rollout_episodes | 7.00 ↑5.00 |
273
+ | fps_instant | 6218 ↑2322.00 |
274
+ | rollout_fps | 22400.17 ↑247.09 |
275
+ | loss | 30.81 ↑25.95 |
276
+ | policy_loss | 0.0083 ↑0.0074 |
277
+ | value_loss | 61.5977 ↑51.8896 |
278
+ | entropy_loss | -0.3653 ↑0.2001 |
279
+ | action_mean | 0.51 ↑0.0016 |
280
+ | action_std | 0.50 ↓3.78e-05 |
281
+ | approx_kl | 0.1479 ↑0.1289 |
282
+ | baseline_mean | 0.00 β†’0 |
283
+ | baseline_std | 0.00 β†’0 |
284
+ | clip_fraction | 0.492 ↑0.152 |
285
+ | clip_range | 0.1596 ↓0.0051 |
286
+ | entropy | 0.3653 ↓0.2001 |
287
+ | explained_variance | 0.198 ↓0.578 |
288
+ | fps | 3065 ↑111 |
289
+ | kl_div | 0.1581 ↑0.1446 |
290
+ | learning_rate | 0.000798 ↓0.000026 |
291
+ | obs_mean | 0.14 ↑0.01 |
292
+ | obs_std | 0.59 ↑0.02 |
293
+ | reward_mean | 1.00 β†’0 |
294
+ | reward_std | 0.00 β†’0 |
295
+ | time_elapsed | 6.68 ↑0.62 |
296
+ -----------------------------------------------
297
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.182) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
298
+ -----------------------------------------------
299
+ | train/ | |
300
+ | ep_rew_mean | 96.61 ↓0.91 |
301
+ | ep_len_mean | 96.00 ↓1.00 |
302
+ | epoch | 88 ↑10 |
303
+ | total_timesteps | 23040 ↑2560.00 |
304
+ | total_episodes | 365 ↑18 |
305
+ | total_rollouts | 90.00 ↑10.00 |
306
+ | rollout_timesteps | 256 β†’0 |
307
+ | rollout_episodes | 1.00 ↓6.00 |
308
+ | fps_instant | 4889 ↓1329.00 |
309
+ | rollout_fps | 22974.76 ↑574.60 |
310
+ | loss | 3.98 ↓26.83 |
311
+ | policy_loss | 0.0053 ↓0.0030 |
312
+ | value_loss | 7.9501 ↓53.6476 |
313
+ | entropy_loss | -0.5333 ↓0.1680 |
314
+ | action_mean | 0.51 ↑0.0010 |
315
+ | action_std | 0.50 ↓2.69e-05 |
316
+ | approx_kl | 0.0189 ↓0.1289 |
317
+ | baseline_mean | 0.00 β†’0 |
318
+ | baseline_std | 0.00 β†’0 |
319
+ | clip_fraction | 0.182 ↓0.310 |
320
+ | clip_range | 0.1544 ↓0.0051 |
321
+ | entropy | 0.5333 ↑0.1680 |
322
+ | explained_variance | 0.924 ↑0.726 |
323
+ | fps | 3186 ↑121 |
324
+ | kl_div | 0.0245 ↓0.1336 |
325
+ | learning_rate | 0.000772 ↓0.000026 |
326
+ | obs_mean | 0.13 ↓0.0021 |
327
+ | obs_std | 0.60 ↑0.01 |
328
+ | reward_mean | 1.00 β†’0 |
329
+ | reward_std | 0.00 β†’0 |
330
+ | time_elapsed | 7.23 ↑0.55 |
331
+ -----------------------------------------------
332
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.183) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
333
+ -----------------------------------------------
334
+ | train/ | |
335
+ | ep_rew_mean | 111.95 ↑15.34 |
336
+ | ep_len_mean | 111.00 ↑15.00 |
337
+ | epoch | 98 ↑10 |
338
+ | total_timesteps | 25600 ↑2560.00 |
339
+ | total_episodes | 382 ↑17 |
340
+ | total_rollouts | 100.00 ↑10.00 |
341
+ | rollout_timesteps | 256 β†’0 |
342
+ | rollout_episodes | 2.00 ↑1.00 |
343
+ | fps_instant | 4399 ↓490 |
344
+ | rollout_fps | 23067.73 ↑92.96 |
345
+ | loss | 4.33 ↑0.35 |
346
+ | policy_loss | -0.0061 ↓0.0114 |
347
+ | value_loss | 8.6635 ↑0.7134 |
348
+ | entropy_loss | -0.5509 ↓0.0177 |
349
+ | action_mean | 0.51 ↑0.0007 |
350
+ | action_std | 0.50 ↓1.96e-05 |
351
+ | approx_kl | 0.0123 ↓0.0066 |
352
+ | baseline_mean | 0.00 β†’0 |
353
+ | baseline_std | 0.00 β†’0 |
354
+ | clip_fraction | 0.183 ↑0.001 |
355
+ | clip_range | 0.1493 ↓0.0051 |
356
+ | entropy | 0.5509 ↑0.0177 |
357
+ | explained_variance | 0.936 ↑0.012 |
358
+ | fps | 3268 ↑83 |
359
+ | kl_div | 0.0209 ↓0.0036 |
360
+ | learning_rate | 0.000747 ↓0.000026 |
361
+ | obs_mean | 0.12 ↓0.01 |
362
+ | obs_std | 0.62 ↑0.02 |
363
+ | reward_mean | 1.00 β†’0 |
364
+ | reward_std | 0.00 β†’0 |
365
+ | time_elapsed | 7.83 ↑0.60 |
366
+ -----------------------------------------------
367
+ /home/tsilva/repos/tsilva/gymnasium-solver/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
368
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.282) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
369
+ -----------------------------------------------
370
+ | train/ | |
371
+ | ep_rew_mean | 111.95 β†’0 |
372
+ | ep_len_mean | 111.00 β†’0 |
373
+ | epoch | 98 β†’0 |
374
+ | total_timesteps | 25600 β†’0 |
375
+ | total_episodes | 382 β†’0 |
376
+ | total_rollouts | 100.00 β†’0 |
377
+ | rollout_timesteps | 256 β†’0 |
378
+ | rollout_episodes | 2.00 β†’0 |
379
+ | fps_instant | 4399 β†’0 |
380
+ | rollout_fps | 23067.73 β†’0 |
381
+ | loss | 0.28 ↓4.05 |
382
+ | policy_loss | 0.0020 ↑0.0081 |
383
+ | value_loss | 0.5542 ↓8.1094 |
384
+ | entropy_loss | -0.5920 ↓0.0410 |
385
+ | action_mean | 0.51 β†’0 |
386
+ | action_std | 0.50 β†’0 |
387
+ | approx_kl | 0.0096 ↓0.0027 |
388
+ | baseline_mean | 0.00 β†’0 |
389
+ | baseline_std | 0.00 β†’0 |
390
+ | clip_fraction | 0.282 ↑0.099 |
391
+ | clip_range | 0.1488 ↓0.0005 |
392
+ | entropy | 0.5920 ↑0.0410 |
393
+ | explained_variance | 0.990 ↑0.054 |
394
+ | fps | 3268 β†’0 |
395
+ | kl_div | 0.0158 ↓0.0050 |
396
+ | learning_rate | 0.000744 ↓0.000003 |
397
+ | obs_mean | 0.12 β†’0 |
398
+ | obs_std | 0.62 β†’0 |
399
+ | reward_mean | 1.00 β†’0 |
400
+ | reward_std | 0.00 β†’0 |
401
+ | time_elapsed | 7.83 β†’0 |
402
+ | eval/ | |
403
+ | ep_rew_mean | 272.40 |
404
+ | ep_len_mean | 272.40 |
405
+ | epoch | 99 |
406
+ | total_timesteps | 6040 |
407
+ | total_episodes | 10 |
408
+ | epoch_fps | 3127.00 |
409
+ -----------------------------------------------
410
+ New best model saved with eval/ep_rew_mean=272.4000
411
+ Timestamped: runs/cvb5lyfw/checkpoints/epoch=99-step=2000.ckpt
412
+ Best: runs/cvb5lyfw/checkpoints/best_checkpoint.ckpt
413
+ Using environment spec reward_threshold: 475.0
414
+ -----------------------------------------------
415
+ | train/ | |
416
+ | ep_rew_mean | 113.88 ↑1.93 |
417
+ | ep_len_mean | 113.00 ↑2.00 |
418
+ | epoch | 108 ↑10 |
419
+ | total_timesteps | 28160 ↑2560.00 |
420
+ | total_episodes | 403 ↑21 |
421
+ | total_rollouts | 110.00 ↑10.00 |
422
+ | rollout_timesteps | 256 β†’0 |
423
+ | rollout_episodes | 1.00 ↓1.00 |
424
+ | fps_instant | 4467 ↑69 |
425
+ | rollout_fps | 23455.68 ↑387.96 |
426
+ | loss | 0.71 ↑0.43 |
427
+ | policy_loss | 0.0094 ↑0.0074 |
428
+ | value_loss | 1.3914 ↑0.8372 |
429
+ | entropy_loss | -0.5770 ↑0.0149 |
430
+ | action_mean | 0.52 ↑0.0020 |
431
+ | action_std | 0.50 ↓0.0001 |
432
+ | approx_kl | 0.0038 ↓0.0057 |
433
+ | baseline_mean | 0.00 β†’0 |
434
+ | baseline_std | 0.00 β†’0 |
435
+ | clip_fraction | 0.088 ↓0.195 |
436
+ | clip_range | 0.1442 ↓0.0046 |
437
+ | entropy | 0.5770 ↓0.0149 |
438
+ | explained_variance | 0.985 ↓0.005 |
439
+ | fps | 2721 ↓548 |
440
+ | kl_div | 0.0016 ↓0.0142 |
441
+ | learning_rate | 0.000721 ↓0.000023 |
442
+ | obs_mean | 0.14 ↑0.02 |
443
+ | obs_std | 0.62 ↑0.0041 |
444
+ | reward_mean | 1.00 β†’0 |
445
+ | reward_std | 0.00 β†’0 |
446
+ | time_elapsed | 10.35 ↑2.52 |
447
+ | eval/ | |
448
+ | ep_rew_mean | 272.40 β†’0 |
449
+ | ep_len_mean | 272.40 β†’0 |
450
+ | epoch | 99 β†’0 |
451
+ | total_timesteps | 6040 β†’0 |
452
+ | total_episodes | 10 β†’0 |
453
+ | epoch_fps | 3127.00 β†’0 |
454
+ -----------------------------------------------
455
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.343) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
456
+ -----------------------------------------------
457
+ | train/ | |
458
+ | ep_rew_mean | 121.08 ↑7.20 |
459
+ | ep_len_mean | 121.00 ↑8.00 |
460
+ | epoch | 118 ↑10 |
461
+ | total_timesteps | 30720 ↑2560.00 |
462
+ | total_episodes | 415 ↑12 |
463
+ | total_rollouts | 120.00 ↑10.00 |
464
+ | rollout_timesteps | 256 β†’0 |
465
+ | rollout_episodes | 0.00 ↓1.00 |
466
+ | fps_instant | 5078 ↑611 |
467
+ | rollout_fps | 23675.15 ↑219.47 |
468
+ | loss | 0.17 ↓0.53 |
469
+ | policy_loss | -0.0080 ↓0.0173 |
470
+ | value_loss | 0.3605 ↓1.0309 |
471
+ | entropy_loss | -0.5659 ↑0.0111 |
472
+ | action_mean | 0.52 ↑0.0002 |
473
+ | action_std | 0.50 ↓6.38e-06 |
474
+ | approx_kl | 0.0105 ↑0.0067 |
475
+ | baseline_mean | 0.00 β†’0 |
476
+ | baseline_std | 0.00 β†’0 |
477
+ | clip_fraction | 0.343 ↑0.255 |
478
+ | clip_range | 0.1391 ↓0.0051 |
479
+ | entropy | 0.5659 ↓0.0111 |
480
+ | explained_variance | 0.995 ↑0.011 |
481
+ | fps | 2805 ↑84 |
482
+ | kl_div | 0.0180 ↑0.0164 |
483
+ | learning_rate | 0.000695 ↓0.000026 |
484
+ | obs_mean | 0.14 ↑0.0037 |
485
+ | obs_std | 0.62 ↓0.0022 |
486
+ | reward_mean | 1.00 β†’0 |
487
+ | reward_std | 0.00 β†’0 |
488
+ | time_elapsed | 10.95 ↑0.60 |
489
+ | eval/ | |
490
+ | ep_rew_mean | 272.40 β†’0 |
491
+ | ep_len_mean | 272.40 β†’0 |
492
+ | epoch | 99 β†’0 |
493
+ | total_timesteps | 6040 β†’0 |
494
+ | total_episodes | 10 β†’0 |
495
+ | epoch_fps | 3127.00 β†’0 |
496
+ -----------------------------------------------
497
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.119) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
498
+ -----------------------------------------------
499
+ | train/ | |
500
+ | ep_rew_mean | 131.68 ↑10.60 |
501
+ | ep_len_mean | 131.00 ↑10.00 |
502
+ | epoch | 128 ↑10 |
503
+ | total_timesteps | 33280 ↑2560.00 |
504
+ | total_episodes | 430 ↑15 |
505
+ | total_rollouts | 130.00 ↑10.00 |
506
+ | rollout_timesteps | 256 β†’0 |
507
+ | rollout_episodes | 3.00 ↑3.00 |
508
+ | fps_instant | 4648 ↓431 |
509
+ | rollout_fps | 23648.98 ↓26.17 |
510
+ | loss | 24.81 ↑24.64 |
511
+ | policy_loss | 0.0002 ↑0.0082 |
512
+ | value_loss | 49.6250 ↑49.2645 |
513
+ | entropy_loss | -0.5436 ↑0.0223 |
514
+ | action_mean | 0.52 ↓0.0006 |
515
+ | action_std | 0.50 ↑1.93e-05 |
516
+ | approx_kl | 0.0064 ↓0.0041 |
517
+ | baseline_mean | 0.00 β†’0 |
518
+ | baseline_std | 0.00 β†’0 |
519
+ | clip_fraction | 0.119 ↓0.224 |
520
+ | clip_range | 0.1340 ↓0.0051 |
521
+ | entropy | 0.5436 ↓0.0223 |
522
+ | explained_variance | 0.438 ↓0.557 |
523
+ | fps | 2889 ↑85 |
524
+ | kl_div | 0.0107 ↓0.0073 |
525
+ | learning_rate | 0.000670 ↓0.000026 |
526
+ | obs_mean | 0.14 ↓0.01 |
527
+ | obs_std | 0.62 ↓0.0026 |
528
+ | reward_mean | 1.00 β†’0 |
529
+ | reward_std | 0.00 β†’0 |
530
+ | time_elapsed | 11.52 ↑0.56 |
531
+ | eval/ | |
532
+ | ep_rew_mean | 272.40 β†’0 |
533
+ | ep_len_mean | 272.40 β†’0 |
534
+ | epoch | 99 β†’0 |
535
+ | total_timesteps | 6040 β†’0 |
536
+ | total_episodes | 10 β†’0 |
537
+ | epoch_fps | 3127.00 β†’0 |
538
+ -----------------------------------------------
539
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.524) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
540
+ -----------------------------------------------
541
+ | train/ | |
542
+ | ep_rew_mean | 139.29 ↑7.61 |
543
+ | ep_len_mean | 139.00 ↑8.00 |
544
+ | epoch | 138 ↑10 |
545
+ | total_timesteps | 35840 ↑2560.00 |
546
+ | total_episodes | 435 ↑5 |
547
+ | total_rollouts | 140.00 ↑10.00 |
548
+ | rollout_timesteps | 256 β†’0 |
549
+ | rollout_episodes | 0.00 ↓3.00 |
550
+ | fps_instant | 4137 ↓510 |
551
+ | rollout_fps | 23805.51 ↑156.53 |
552
+ | loss | -0.01 ↓24.82 |
553
+ | policy_loss | -0.0318 ↓0.0320 |
554
+ | value_loss | 0.0524 ↓49.5726 |
555
+ | entropy_loss | -0.5789 ↓0.0353 |
556
+ | action_mean | 0.52 ↓0.0007 |
557
+ | action_std | 0.50 ↑2.20e-05 |
558
+ | approx_kl | 0.0180 ↑0.0116 |
559
+ | baseline_mean | 0.00 β†’0 |
560
+ | baseline_std | 0.00 β†’0 |
561
+ | clip_fraction | 0.524 ↑0.405 |
562
+ | clip_range | 0.1288 ↓0.0051 |
563
+ | entropy | 0.5789 ↑0.0353 |
564
+ | explained_variance | 0.138 ↓0.300 |
565
+ | fps | 2957 ↑67 |
566
+ | kl_div | 0.0276 ↑0.0169 |
567
+ | learning_rate | 0.000644 ↓0.000026 |
568
+ | obs_mean | 0.12 ↓0.01 |
569
+ | obs_std | 0.62 ↓0.0024 |
570
+ | reward_mean | 1.00 β†’0 |
571
+ | reward_std | 0.00 β†’0 |
572
+ | time_elapsed | 12.12 ↑0.60 |
573
+ | eval/ | |
574
+ | ep_rew_mean | 272.40 β†’0 |
575
+ | ep_len_mean | 272.40 β†’0 |
576
+ | epoch | 99 β†’0 |
577
+ | total_timesteps | 6040 β†’0 |
578
+ | total_episodes | 10 β†’0 |
579
+ | epoch_fps | 3127.00 β†’0 |
580
+ -----------------------------------------------
581
+ -----------------------------------------------
582
+ | train/ | |
583
+ | ep_rew_mean | 162.72 ↑23.43 |
584
+ | ep_len_mean | 162.00 ↑23.00 |
585
+ | epoch | 148 ↑10 |
586
+ | total_timesteps | 38400 ↑2560.00 |
587
+ | total_episodes | 440 ↑5 |
588
+ | total_rollouts | 150.00 ↑10.00 |
589
+ | rollout_timesteps | 256 β†’0 |
590
+ | rollout_episodes | 0.00 β†’0 |
591
+ | fps_instant | 4708 ↑570 |
592
+ | rollout_fps | 24173.95 ↑368.44 |
593
+ | loss | 0.01 ↑0.01 |
594
+ | policy_loss | -0.0012 ↑0.0306 |
595
+ | value_loss | 0.0134 ↓0.0390 |
596
+ | entropy_loss | -0.5216 ↑0.0573 |
597
+ | action_mean | 0.51 ↓0.0010 |
598
+ | action_std | 0.50 ↑2.89e-05 |
599
+ | approx_kl | 0.0011 ↓0.0170 |
600
+ | baseline_mean | 0.00 β†’0 |
601
+ | baseline_std | 0.00 β†’0 |
602
+ | clip_fraction | 0.029 ↓0.494 |
603
+ | clip_range | 0.1237 ↓0.0051 |
604
+ | entropy | 0.5216 ↓0.0573 |
605
+ | explained_variance | 0.937 ↑0.800 |
606
+ | fps | 3026 ↑70 |
607
+ | kl_div | 0.0038 ↓0.0238 |
608
+ | learning_rate | 0.000619 ↓0.000026 |
609
+ | obs_mean | 0.09 ↓0.03 |
610
+ | obs_std | 0.63 ↑0.02 |
611
+ | reward_mean | 1.00 β†’0 |
612
+ | reward_std | 0.00 β†’0 |
613
+ | time_elapsed | 12.69 ↑0.57 |
614
+ | eval/ | |
615
+ | ep_rew_mean | 272.40 β†’0 |
616
+ | ep_len_mean | 272.40 β†’0 |
617
+ | epoch | 99 β†’0 |
618
+ | total_timesteps | 6040 β†’0 |
619
+ | total_episodes | 10 β†’0 |
620
+ | epoch_fps | 3127.00 β†’0 |
621
+ -----------------------------------------------
622
+ -----------------------------------------------
623
+ | train/ | |
624
+ | ep_rew_mean | 186.74 ↑24.02 |
625
+ | ep_len_mean | 186.00 ↑24.00 |
626
+ | epoch | 158 ↑10 |
627
+ | total_timesteps | 40960 ↑2560.00 |
628
+ | total_episodes | 445 ↑5 |
629
+ | total_rollouts | 160.00 ↑10.00 |
630
+ | rollout_timesteps | 256 β†’0 |
631
+ | rollout_episodes | 0.00 β†’0 |
632
+ | fps_instant | 5483 ↑775 |
633
+ | rollout_fps | 24476.36 ↑302.41 |
634
+ | loss | -0.00 ↓0.01 |
635
+ | policy_loss | -0.0021 ↓0.0009 |
636
+ | value_loss | 0.0021 ↓0.0113 |
637
+ | entropy_loss | -0.5528 ↓0.0311 |
638
+ | action_mean | 0.51 ↓0.0007 |
639
+ | action_std | 0.50 ↑1.97e-05 |
640
+ | approx_kl | 0.0013 ↑0.0002 |
641
+ | baseline_mean | 0.00 β†’0 |
642
+ | baseline_std | 0.00 β†’0 |
643
+ | clip_fraction | 0.041 ↑0.012 |
644
+ | clip_range | 0.1186 ↓0.0051 |
645
+ | entropy | 0.5528 ↑0.0311 |
646
+ | explained_variance | 0.832 ↓0.106 |
647
+ | fps | 3082 ↑55 |
648
+ | kl_div | -0.0002 ↓0.0040 |
649
+ | learning_rate | 0.000593 ↓0.000026 |
650
+ | obs_mean | 0.07 ↓0.03 |
651
+ | obs_std | 0.64 ↑0.01 |
652
+ | reward_mean | 1.00 β†’0 |
653
+ | reward_std | 0.00 β†’0 |
654
+ | time_elapsed | 13.29 ↑0.60 |
655
+ | eval/ | |
656
+ | ep_rew_mean | 272.40 β†’0 |
657
+ | ep_len_mean | 272.40 β†’0 |
658
+ | epoch | 99 β†’0 |
659
+ | total_timesteps | 6040 β†’0 |
660
+ | total_episodes | 10 β†’0 |
661
+ | epoch_fps | 3127.00 β†’0 |
662
+ -----------------------------------------------
663
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.130) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
664
+ -----------------------------------------------
665
+ | train/ | |
666
+ | ep_rew_mean | 209.88 ↑23.14 |
667
+ | ep_len_mean | 209.00 ↑23.00 |
668
+ | epoch | 168 ↑10 |
669
+ | total_timesteps | 43520 ↑2560.00 |
670
+ | total_episodes | 451 ↑6 |
671
+ | total_rollouts | 170.00 ↑10.00 |
672
+ | rollout_timesteps | 256 β†’0 |
673
+ | rollout_episodes | 1.00 ↑1.00 |
674
+ | fps_instant | 4587 ↓896 |
675
+ | rollout_fps | 24439.38 ↓36.98 |
676
+ | loss | -0.01 ↓0.01 |
677
+ | policy_loss | -0.0078 ↓0.0056 |
678
+ | value_loss | 0.0027 ↑0.0007 |
679
+ | entropy_loss | -0.5290 ↑0.0237 |
680
+ | action_mean | 0.51 ↓0.0009 |
681
+ | action_std | 0.50 ↑2.43e-05 |
682
+ | approx_kl | 0.0028 ↑0.0015 |
683
+ | baseline_mean | 0.00 β†’0 |
684
+ | baseline_std | 0.00 β†’0 |
685
+ | clip_fraction | 0.130 ↑0.089 |
686
+ | clip_range | 0.1135 ↓0.0051 |
687
+ | entropy | 0.5290 ↓0.0237 |
688
+ | explained_variance | 0.399 ↓0.433 |
689
+ | fps | 3112 ↑31 |
690
+ | kl_div | 0.0016 ↑0.0018 |
691
+ | learning_rate | 0.000567 ↓0.000026 |
692
+ | obs_mean | 0.05 ↓0.02 |
693
+ | obs_std | 0.64 ↓0.0049 |
694
+ | reward_mean | 1.00 β†’0 |
695
+ | reward_std | 0.00 β†’0 |
696
+ | time_elapsed | 13.98 ↑0.69 |
697
+ | eval/ | |
698
+ | ep_rew_mean | 272.40 β†’0 |
699
+ | ep_len_mean | 272.40 β†’0 |
700
+ | epoch | 99 β†’0 |
701
+ | total_timesteps | 6040 β†’0 |
702
+ | total_episodes | 10 β†’0 |
703
+ | epoch_fps | 3127.00 β†’0 |
704
+ -----------------------------------------------
705
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.164) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
706
+ ⚠️ ALGORITHM WARNING: Very negative explained variance (-0.106) indicates value function is performing poorly. Check value function architecture or learning rate.
707
+ -----------------------------------------------
708
+ | train/ | |
709
+ | ep_rew_mean | 230.88 ↑21.00 |
710
+ | ep_len_mean | 230.00 ↑21.00 |
711
+ | epoch | 178 ↑10 |
712
+ | total_timesteps | 46080 ↑2560.00 |
713
+ | total_episodes | 456 ↑5 |
714
+ | total_rollouts | 180.00 ↑10.00 |
715
+ | rollout_timesteps | 256 β†’0 |
716
+ | rollout_episodes | 0.00 ↓1.00 |
717
+ | fps_instant | 4984 ↑397 |
718
+ | rollout_fps | 24308.83 ↓130.55 |
719
+ | loss | -0.01 ↓0.0013 |
720
+ | policy_loss | -0.0101 ↓0.0023 |
721
+ | value_loss | 0.0048 ↑0.0020 |
722
+ | entropy_loss | -0.4638 ↑0.0653 |
723
+ | action_mean | 0.51 ↓0.0007 |
724
+ | action_std | 0.50 ↑1.76e-05 |
725
+ | approx_kl | 0.0037 ↑0.0009 |
726
+ | baseline_mean | 0.00 β†’0 |
727
+ | baseline_std | 0.00 β†’0 |
728
+ | clip_fraction | 0.164 ↑0.034 |
729
+ | clip_range | 0.1084 ↓0.0051 |
730
+ | entropy | 0.4638 ↓0.0653 |
731
+ | explained_variance | -0.106 ↓0.505 |
732
+ | fps | 3154 ↑42 |
733
+ | kl_div | 0.0033 ↑0.0017 |
734
+ | learning_rate | 0.000542 ↓0.000026 |
735
+ | obs_mean | 0.04 ↓0.01 |
736
+ | obs_std | 0.62 ↓0.01 |
737
+ | reward_mean | 1.00 β†’0 |
738
+ | reward_std | 0.00 β†’0 |
739
+ | time_elapsed | 14.61 ↑0.63 |
740
+ | eval/ | |
741
+ | ep_rew_mean | 272.40 β†’0 |
742
+ | ep_len_mean | 272.40 β†’0 |
743
+ | epoch | 99 β†’0 |
744
+ | total_timesteps | 6040 β†’0 |
745
+ | total_episodes | 10 β†’0 |
746
+ | epoch_fps | 3127.00 β†’0 |
747
+ -----------------------------------------------
748
+ ⚠️ ALGORITHM WARNING: High clip fraction (0.142) indicates policy is changing too rapidly. Consider reducing learning rate or clip_range.
749
+ -----------------------------------------------
750
+ | train/ | |
751
+ | ep_rew_mean | 252.27 ↑21.39 |
752
+ | ep_len_mean | 252.00 ↑22.00 |
753
+ | epoch | 188 ↑10 |
754
+ | total_timesteps | 48640 ↑2560.00 |
755
+ | total_episodes | 461 ↑5 |
756
+ | total_rollouts | 190.00 ↑10.00 |
757
+ | rollout_timesteps | 256 β†’0 |
758
+ | rollout_episodes | 0.00 β†’0 |
759
+ | fps_instant | 4553 ↓432 |
760
+ | rollout_fps | 23721.97 ↓586.86 |
761
+ | loss | -0.00 ↑0.0042 |
762
+ | policy_loss | -0.0045 ↑0.0056 |
763
+ | value_loss | 0.0021 ↓0.0027 |
764
+ | entropy_loss | -0.3854 ↑0.0783 |
765
+ | action_mean | 0.51 ↓0.0007 |
766
+ | action_std | 0.50 ↑1.58e-05 |
767
+ | approx_kl | 0.0022 ↓0.0015 |
768
+ | baseline_mean | 0.00 β†’0 |
769
+ | baseline_std | 0.00 β†’0 |
770
+ | clip_fraction | 0.142 ↓0.022 |
771
+ | clip_range | 0.1032 ↓0.0051 |
772
+ | entropy | 0.3854 ↓0.0783 |
773
+ | explained_variance | 0.883 ↑0.989 |
774
+ | fps | 3194 ↑40 |
775
+ | kl_div | -0.0016 ↓0.0049 |
776
+ | learning_rate | 0.000516 ↓0.000026 |
777
+ | obs_mean | 0.03 ↓0.01 |
778
+ | obs_std | 0.61 ↓0.01 |
779
+ | reward_mean | 1.00 β†’0 |
780
+ | reward_std | 0.00 β†’0 |
781
+ | time_elapsed | 15.23 ↑0.62 |
782
+ | eval/ | |
783
+ | ep_rew_mean | 272.40 β†’0 |
784
+ | ep_len_mean | 272.40 β†’0 |
785
+ | epoch | 99 β†’0 |
786
+ | total_timesteps | 6040 β†’0 |
787
+ | total_episodes | 10 β†’0 |
788
+ | epoch_fps | 3127.00 β†’0 |
789
+ -----------------------------------------------
790
+ ⚠️ ALGORITHM WARNING: High clip range (0.0981) may lead to unstable training. Consider reducing.
791
+ -----------------------------------------------
792
+ | train/ | |
793
+ | ep_rew_mean | 271.25 ↑18.98 |
794
+ | ep_len_mean | 271.00 ↑19.00 |
795
+ | epoch | 198 ↑10 |
796
+ | total_timesteps | 51200 ↑2560.00 |
797
+ | total_episodes | 466 ↑5 |
798
+ | total_rollouts | 200.00 ↑10.00 |
799
+ | rollout_timesteps | 256 β†’0 |
800
+ | rollout_episodes | 0.00 β†’0 |
801
+ | fps_instant | 4227 ↓326 |
802
+ | rollout_fps | 23444.23 ↓277.74 |
803
+ | loss | 0.00 ↑0.0045 |
804
+ | policy_loss | -0.0024 ↑0.0022 |
805
+ | value_loss | 0.0068 ↑0.0047 |
806
+ | entropy_loss | -0.4011 ↓0.0157 |
807
+ | action_mean | 0.51 ↓0.0005 |
808
+ | action_std | 0.50 ↑1.22e-05 |
809
+ | approx_kl | 0.0010 ↓0.0012 |
810
+ | baseline_mean | 0.00 β†’0 |
811
+ | baseline_std | 0.00 β†’0 |
812
+ | clip_fraction | 0.059 ↓0.084 |
813
+ | clip_range | 0.0981 ↓0.0051 |
814
+ | entropy | 0.4011 ↑0.0157 |
815
+ | explained_variance | 0.706 ↓0.177 |
816
+ | fps | 3224 ↑30 |
817
+ | kl_div | -0.0002 ↑0.0014 |
818
+ | learning_rate | 0.000491 ↓0.000026 |
819
+ | obs_mean | 0.02 ↓0.01 |
820
+ | obs_std | 0.60 ↓0.01 |
821
+ | reward_mean | 1.00 β†’0 |
822
+ | reward_std | 0.00 β†’0 |
823
+ | time_elapsed | 15.88 ↑0.65 |
824
+ | eval/ | |
825
+ | ep_rew_mean | 272.40 β†’0 |
826
+ | ep_len_mean | 272.40 β†’0 |
827
+ | epoch | 99 β†’0 |
828
+ | total_timesteps | 6040 β†’0 |
829
+ | total_episodes | 10 β†’0 |
830
+ | epoch_fps | 3127.00 β†’0 |
831
+ -----------------------------------------------
832
+ ⚠️ ALGORITHM WARNING: High clip range (0.0976) may lead to unstable training. Consider reducing.
833
+ -----------------------------------------------
834
+ | train/ | |
835
+ | ep_rew_mean | 271.25 β†’0 |
836
+ | ep_len_mean | 271.00 β†’0 |
837
+ | epoch | 198 β†’0 |
838
+ | total_timesteps | 51200 β†’0 |
839
+ | total_episodes | 466 β†’0 |
840
+ | total_rollouts | 200.00 β†’0 |
841
+ | rollout_timesteps | 256 β†’0 |
842
+ | rollout_episodes | 0.00 β†’0 |
843
+ | fps_instant | 4227 β†’0 |
844
+ | rollout_fps | 23444.23 β†’0 |
845
+ | loss | -0.00 ↓0.0030 |
846
+ | policy_loss | -0.0030 ↓0.0006 |
847
+ | value_loss | 0.0021 ↓0.0047 |
848
+ | entropy_loss | -0.4021 ↓0.0009 |
849
+ | action_mean | 0.51 β†’0 |
850
+ | action_std | 0.50 β†’0 |
851
+ | approx_kl | 0.0011 ↑0.0001 |
852
+ | baseline_mean | 0.00 β†’0 |
853
+ | baseline_std | 0.00 β†’0 |
854
+ | clip_fraction | 0.080 ↑0.021 |
855
+ | clip_range | 0.0976 ↓0.0005 |
856
+ | entropy | 0.4021 ↑0.0009 |
857
+ | explained_variance | 0.865 ↑0.160 |
858
+ | fps | 3224 β†’0 |
859
+ | kl_div | -0.0027 ↓0.0025 |
860
+ | learning_rate | 0.000488 ↓0.000003 |
861
+ | obs_mean | 0.02 β†’0 |
862
+ | obs_std | 0.60 β†’0 |
863
+ | reward_mean | 1.00 β†’0 |
864
+ | reward_std | 0.00 β†’0 |
865
+ | time_elapsed | 15.88 β†’0 |
866
+ | eval/ | |
867
+ | ep_rew_mean | 500.00 ↑227.60 |
868
+ | ep_len_mean | 500.00 ↑227.60 |
869
+ | epoch | 199 ↑100 |
870
+ | total_timesteps | 8000 ↑1960.00 |
871
+ | total_episodes | 10 β†’0 |
872
+ | epoch_fps | 4822.00 ↑1695.00 |
873
+ -----------------------------------------------
874
+ New best model saved with eval/ep_rew_mean=500.0000
875
+ Timestamped: runs/cvb5lyfw/checkpoints/epoch=199-step=4000.ckpt
876
+ Best: runs/cvb5lyfw/checkpoints/best_checkpoint.ckpt
877
+ Threshold reached! Saved model with eval/ep_rew_mean=500.0000 (threshold=475.0) at runs/cvb5lyfw/checkpoints/threshold-epoch=199-step=4000.ckpt
878
+ Early stopping at epoch 199 with eval mean reward 500.00 >= threshold 475.0
879
+ Using environment spec reward_threshold: 475.0
880
+ Best model saved at runs/cvb5lyfw/checkpoints/best_checkpoint.ckpt with eval reward 500.00
881
+ Loading checkpoint from runs/cvb5lyfw/checkpoints/best_checkpoint.ckpt
882
+ Checkpoint loaded:
883
+ Epoch: 199
884
+ Total timesteps: 0
885
+ Best eval reward: 272.3999938964844
886
+ Current eval reward: 500.0
887
+ Is best: True
888
+ Is threshold: False
889
+ Saved final evaluation video to: runs/cvb5lyfw/videos/eval/episodes/best_checkpoint.mp4
890
+
891
+ πŸ“Š Final hyperparameters:
892
+ Learning rate: 1.00e-03
893
+ Entropy coef: 0.000
894
+ Max grad norm: 0.500
895
+ Clip range: 0.200
896
+ Value function coef: 0.500
897
+ Training completed in 24.45 seconds (0.41 minutes)
artifacts/videos/eval/episodes/best_checkpoint.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a12461a39db591a152b968f8a0f976630bb100f6aae6540412a9ab19322a9621
3
+ size 152489