On the Hidden Objective Biases of Group-based Reinforcement Learning
Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical …
aleksandar-fontana