Papers
arxiv:2506.09340

RePO: Replay-Enhanced Policy Optimization

Published on Jun 11
Authors:
,
,
,
,

Abstract

Replay-Enhanced Policy Optimization (RePO) improves reinforcement learning for large language models by using diverse off-policy samples, enhancing performance and optimization efficiency.

AI-generated summary

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of 18.4 and 4.1 points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by 15% while raising the number of effective optimization steps by 48% for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to 8. The repository can be accessed at https://github.com/SihengLi99/RePO.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.09340 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.09340 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.09340 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.