Goldilocks RL: The Complexity of the Maintenance Work to Escape the Small Rewards of Consultation

nimda March 18, 2026

0 4 1 minute read

Goldilocks RL: The Complexity of the Maintenance Work to Escape the Small Rewards of Consultation

Reinforcement learning has emerged as a powerful paradigm for unlocking cognitive abilities in large-scale linguistic models. However, the reliance on small rewards makes this technique inefficient as a sampler, as models must navigate large search spaces with little feedback. Although classic curriculum studies aim to minimize this by ordering data based on complexity, the appropriate order of a particular model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict the difficulty of each question in a student model. The teacher model selects questions of appropriate difficulty for the student model, that is, questions that are neither too easy nor too difficult (the Goldilocks principle), while training the student on GRPO. By using the student's strengths in the samples shown, the teacher continues to adapt to the student's developing skills. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with conventional GRPO under the same computational budget.