TRL documentation
Paper Index
Paper Index
Section under construction. Feel free to contribute!
Group Sequence Policy Optimization
📜 Paper: https://huggingface.co/papers/2507.18071
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
steps_per_generation=...,
beta=0.04, # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
epsilon=3e-4, # https://x.com/ChujieZheng/status/1948933507696525392
)
While the original paper doesn’t specify the hyperparameters used, this modification only has an effect when training is slightly off-policy—for example, when steps_per_generation > gradient_accumulation_steps
or num_iterations > 1
. Otherwise, it is effectively equivalent to no modification.