Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
6
240
137
rotem israeli
irotem98
Follow
Davidchiu's profile picture
ltim's profile picture
BK-Lee's profile picture
7 followers
·
10 following
https://rotem154154.github.io
rotem154154
AI & ML interests
None yet
Recent Activity
reacted
to
Kseniase
's
post
with 👍
21 minutes ago
9 new policy optimization techniques Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents. Here are 9 fresh policy optimization techniques worth knowing: 1. GSPO: Group Sequence Policy Optimization → https://huggingface.co/papers/2507.18071 Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning. 2. LAPO: Length-Adaptive Policy Optimization → https://huggingface.co/papers/2507.15758 A two-stage RL framework that trains models to adaptively control reasoning length by learning typical solution lengths for shorter and more efficient reasoning. 3. HBPO: Hierarchical Budget Policy Optimization → https://huggingface.co/papers/2507.15844 This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty. 4. SOPHIA: Semi-off-policy reinforcement learning → https://huggingface.co/papers/2507.16814 Combines on-policy visual understanding from the Vision Language Models (VLMs) with off-policy reasoning from an LM, assigning outcome-based rewards and propagating visual rewards backward through the reasoning steps. 5. RePO: Replay-Enhanced Policy Optimization → https://huggingface.co/papers/2506.09340 Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt Read further below ⬇️ If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe
upvoted
a
paper
about 6 hours ago
Group Sequence Policy Optimization
liked
a model
about 21 hours ago
Qwen/Qwen3-4B
View all activity
Organizations
None yet
spaces
1
Runtime error
Edge Vlm
🐨
models
1
irotem98/emu3_test_smollm_360m_2
Updated
Oct 8, 2024
•
65
datasets
37
Sort: Recently updated
irotem98/htmls10k
Preview
•
Updated
Jun 3
•
6
•
1
irotem98/htmls1k
Updated
May 29
•
18
irotem98/CodePen-25k
Updated
May 27
•
12
irotem98/mc_html_screenshot
Viewer
•
Updated
May 20
•
1.64k
•
106
irotem98/imagenet_3gb
Updated
Mar 18
•
9
irotem98/sfhq_encoded
Updated
Mar 17
•
15
irotem98/counter_strike_encoded_frames
Updated
Nov 26, 2024
•
5
irotem98/minerl_encoded
Updated
Nov 17, 2024
•
5
irotem98/minerl
Updated
Nov 15, 2024
•
4
irotem98/minerl_videos
Updated
Nov 15, 2024
•
3
View 37 datasets