@Kseniase on Hugging Face: "9 new policy optimization techniques Reinforcement Learning (RL) won't stuck…"

Post

4879

9 new policy optimization techniques

Reinforcement Learning (RL) won't stuck in the same old PPO loop - in the last two months alone, researchers have introduced a new wave of techniques, reshaping how we train and fine-tune LLMs, VLMs, and agents.

Here are 9 fresh policy optimization techniques worth knowing:

1. GSPO: Group Sequence Policy Optimization → Group Sequence Policy Optimization (2507.18071)
Shifts from token-level to sequence-level optimization, clipping, and rewarding to capture the full picture and increase stability compared to GRPO. GSPO-token variation also allows token-level fine-tuning.

2. LAPO: Length-Adaptive Policy Optimization → LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization (2507.15758)
A two-stage RL framework that trains models to adaptively control reasoning length by learning typical solution lengths for shorter and more efficient reasoning.

3. HBPO: Hierarchical Budget Policy Optimization → Hierarchical Budget Policy Optimization for Adaptive Reasoning (2507.15844)
This one trains model to adapt reasoning depth based on problem complexity. It divides training samples into subgroups with different token budgets, using budget-aware rewards to align reasoning effort with task difficulty.

4. SOPHIA: Semi-off-policy reinforcement learning → Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning (2507.16814)
Combines on-policy visual understanding from the Vision Language Models (VLMs) with off-policy reasoning from an LM, assigning outcome-based rewards and propagating visual rewards backward through the reasoning steps.

5. RePO: Replay-Enhanced Policy Optimization → RePO: Replay-Enhanced Policy Optimization (2506.09340)
Introduces a replay buffer into on-policy RL for LLMs, retrieving diverse off-policy samples for each prompt to broaden the training data per prompt

Read further below ⬇️
If you like it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe

CISPO: Clipped Importance Sampling Policy Optimization →
https://huggingface.co/papers/2506.13585
This RL algorithm from the MiniMax-M1 project clips importance-sampling weights instead of per-token updates. This lets all tokens (even rare but crucial ones) contribute to learning, avoiding the token-level clipping. CISPO also avoids KL penalties and uses group relative advantage like GRPO.
PAPO: Perception-Aware Policy Optimization → https://huggingface.co/papers/2507.06448
Enhances RL in vision-language tasks by adding a KL-based perception loss to the GRPO objective for better visual alignment during training. It boosts accuracy by 4–8% and reduces perception errors by ~30%.
OPO: On-Policy RL with Optimal Baseline → https://huggingface.co/papers/2505.23585
A simplified RL algorithm from Microsoft that enforces strict on-policy training by using freshly sampled outputs from the current policy for every update, minimizing off-policy drift. It minimizes gradient variance, avoiding auxiliary models and regularization.
EXPO: Expressive Policy Optimization → https://huggingface.co/papers/2507.07986
Trains complex policies by pairing a large base model with a lightweight edit policy that suggests better actions, selecting the best of both without backpropagating through the base.

Join the conversation