Papers
arxiv:2507.18071

Group Sequence Policy Optimization

Published on Jul 24
· Submitted by chujiezheng on Jul 25
#1 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

Group Sequence Policy Optimization (GSPO) is a reinforcement learning algorithm that improves training efficiency and performance of large language models by using sequence-level importance ratios and operations.

AI-generated summary

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

Community

Paper author Paper submitter

This paper introduces Group Sequence Policy Optimization (GSPO), a stable, efficient,
and performant RL algorithm for training the latest Qwen3 models (Instruct, Coder, and Thinking)

·

beautiful 🔥

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Any open source implementation?

·

To address the high variance issue in token-level importance sampling and the information loss in GSPO's sequence-level approach, I propose a Subsequence-level Clipped Importance Sampling method. For a sequence
a=(a1,,aT) a = (a_1, \dots, a_T)
split into K subsequences, compute weights as:
ρsub,k=clip(πθ(asub,ks)πθold(asub,ks),1ϵ,1+ϵ),ρ=k=1Kρsub,k\rho_{\text{sub}, k} = \text{clip}\left( \frac{\pi_{\theta}(a_{\text{sub}, k} | s)}{\pi_{\theta_{\text{old}}}(a_{\text{sub}, k} | s)}, 1-\epsilon, 1+\epsilon \right), \quad \rho = \prod_{k=1}^K \rho_{\text{sub}, k}
Add a trust region constraint:
Es[DKL(πθoldπθ)]δ \mathbb{E}_s [D_{\text{KL}}(\pi_{\theta_{\text{old}}} || \pi_{\theta})] \leq \delta
This reduces variance by limiting product terms, retains local information via subsequence granularity, and ensures stability with clipping and KL constraints, outperforming GSPO in flexibility and efficiency.

This comment has been hidden (marked as Resolved)

Awesome work!!

Thanks for the awesome work! May I ask if you have plans to release the verl code for GSPO?

·

Thank you so much the prompt response! It helped a lot!

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.18071 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.18071 in a Space README.md to link it from this page.

Collections including this paper 34