view article Article Visualize and understand GPU memory in PyTorch By qgallouedec • Dec 24, 2024 • 243
view article Article Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO) By ariG23498 • Jan 19 • 28
view reply hi there.i think there is an error in your PPO description, actually, PPO does not explicitly penalize the KL divergence from the initial (reference) policy.
view article Article DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge By NormalUhr • Feb 7 • 211