Models need feedback on what makes outputs βgoodβ or βbad.β Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods:
3. DCPO (Dynamic Clipping Policy Optimization) β DCPO: Dynamic Clipping Policy Optimization (2509.02333) Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates
4. ARPO (Agentic Reinforced Policy Optimization) β Agentic Reinforced Policy Optimization (2507.19849) Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources
5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) β rStar2-Agent: Agentic Reasoning Technical Report (2508.20722) Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment