MixGRPO:
Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li^1,^2,³^*, Yutao Cui¹^*, Tao Huang¹, Yinping Ma³, Chun Fan³, Miles Yang¹, Zhao Zhong¹

¹Hunyuan, Tencent
²School of Computer Science, Peking University
³Computer Center, Peking University

Abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

🌟 Model

The diffusion_pytorch_model.safetensors is based on FLUX.1 Dev using the MixGRPO algorithm, with HPSv2, ImageReward, and Pick Score as multi-rewards after training.

🚀 Quick Start (Sample Usage)

Please see the hybrid inference code for solving reward hacking in github: https://github.com/Tencent-Hunyuan/MixGRPO?tab=readme-ov-file#run-inference.

✏️ Citation

License

MixGRPO is licensed under the License Terms of MixGRPO. See ./License.txt for more details.

Bib

If you find MixGRPO useful for your research and applications, please cite using this BibTeX:

@misc{li2025mixgrpounlockingflowbasedgrpo,
      title={MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE}, 
      author={Junzhe Li and Yutao Cui and Tao Huang and Yinping Ma and Chun Fan and Miles Yang and Zhao Zhong},
      year={2025},
      eprint={2507.21802},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.21802}, 
}