TRL

Paper Index

Section under construction. Feel free to contribute!

Group Sequence Policy Optimization

📜 Paper: https://huggingface.co/papers/2507.18071

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

While the original paper doesn’t specify the hyperparameters used, this modification only has an effect when training is slightly off-policy—for example, when steps_per_generation > gradient_accumulation_steps or num_iterations > 1. Otherwise, it is effectively equivalent to no modification.

< > Update on GitHub