a3-rl-DCAgent_exp_rpt_unitsyn-python-large (step 10, 8B)

RL (GRPO / rloo_n, SkyRL) fine-tune of laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink (Qwen3-8B architecture) on the agentic task set DCAgent/exp_rpt_unitsyn-python-large.

This checkpoint is global_step_10, the highest-EMA saved/aligned checkpoint per the standard trailing-5-step EMA selection rule. Note that the EMA at step 10 is only 0.0098 and the raw reward there is 0.0000.

⚠️ Run status: COLLAPSED / FAILED — reward never left the floor

This run never learned. reward/avg_raw_reward started at ~0.08 at step 1 and decayed to a flat 0.0 within the first ~10 steps, with policy_entropy collapsing to 0.0 in lockstep. It stayed at reward ≈ 0 for the remainder of training (a brief blip to 0.035 at step 16, then back to 0).

step	avg_raw_reward	pass@8	policy_entropy
1	0.080	0.266	0.0219
4	0.041	0.156	0.0097
8	0.023	0.109	0.0040
10	0.000 (best EMA)	0.000	0.0000
15	0.000	0.000	0.0000
16	0.035	0.156	0.0067
17	0.000 (final)	0.000	0.0000

No checkpoint in this run is a usable, reward-positive model. Step 10 is published only to satisfy the standard cleanup-checklist EMA selection (highest EMA among saved steps ≥ 10); its actual reward is zero. The single highest-reward step (step 1, reward 0.08) was not a saved/aligned checkpoint and is below any useful threshold anyway.

Treat this as a failed / collapsed RL run preserved for the record — NOT a model anyone should deploy or evaluate as a winner.

Training config

Algorithm: GRPO (rloo_n advantage), SkyRL fully-async trainer, FSDP2
Base: laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink
Dataset: DCAgent/exp_rpt_unitsyn-python-large
hf_save_interval = 5, max_steps = 80 (run collapsed/exhausted at step ~17)
14 nodes × 4 GH200 (Jupiter), TP=1

See rl_config.json and training_logs/ (metrics CSVs, reward plot, per-trial results from parse_skyrl_metrics.py) in this repo.

Training Traces

Training-time Daytona/Harbor rollouts for this run are uploaded as a companion dataset: penfever/a3-rl-DCAgent_exp_rpt_unitsyn-python-large

The dataset contains the last episode of each trial (per make_and_upload_trace_dataset --episodes last) — the same rollouts the policy was trained on after rollback / truncation.

Downloads last month: 24

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/a3-rl-DCAgent_exp_rpt_unitsyn-python-large-10-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

laion/GLM-4_7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k-fixthink

Finetuned

(14)

this model