TRL documentation

Paper Index

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Paper Index

Section under construction. Feel free to contribute!

Group Sequence Policy Optimization

📜 Paper: https://huggingface.co/papers/2507.18071

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token. To reproduce the paper’s setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

While the original paper doesn’t specify the hyperparameters used, this modification only has an effect when training is slightly off-policy—for example, when steps_per_generation > gradient_accumulation_steps or num_iterations > 1. Otherwise, it is effectively equivalent to no modification.

< > Update on GitHub