eoe

eoe

AI & ML interests

None yet

Recent Activity

reacted to anakin87's post with ā¤ļø 1 day ago
How LLM training with RL Environments works? It all starts with š—„š—²š—¶š—»š—³š—¼š—æš—°š—²š—ŗš—²š—»š˜ š—Ÿš—²š—®š—æš—»š—¶š—»š—“ š˜„š—¶š˜š—µ š—©š—²š—æš—¶š—³š—¶š—®š—Æš—¹š—² š—„š—²š˜„š—®š—æš—±š˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env āŒā­• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use š—šš—æš—¼š˜‚š—½ š—„š—²š—¹š—®š˜š—¶š˜ƒš—² š—£š—¼š—¹š—¶š—°š˜† š—¢š—½š˜š—¶š—ŗš—¶š˜‡š—®š˜š—¶š—¼š—» with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1ļøāƒ£ Rollout generation: from the same board, model plays N games via sampling 2ļøāƒ£ Each game scored with deterministic rewards (win, format, ...) 3ļøāƒ£ Mean score computed across the group 4ļøāƒ£ Each rollout's advantage = its score minus the group mean 5ļøāƒ£ Model updated to favor trajectories above baseline šŸ” Repeat For a deep dive, check out 🌱 https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity

Organizations

None yet