Reactive Machines

Group-Specific Correlative Policy Development for Heterogeneous Preference Alignment

Despite their complex general purpose capabilities, Large Language Models (LLMs) often fail to align with diverse preferences because standard post-training methods, such as Reinforcement Learning with Human Feedback (RLHF), target a single, global goal. Although Group Relative Policy Optimization (GRPO) is a widely used learning framework for policy optimization, its group-based optimization implicitly assumes that all samples are interchangeable, inheriting this limitation in personalized settings. This assumption involves different user reward distributions and systematically biases the learning of dominant preferences while suppressing small signals. To address this, we introduce personal PRPO (P-GRPO), a novel alignment framework that separates profit estimation from fast batch calculations. By normalizing benefits against reward histories specific to a group rather than a cohort, P-GRPO preserves the differential signal needed to learn differential preferences. We test P-GRPO across a variety of tasks and find that it achieves faster consistency and higher rewards than conventional GRPO, thereby improving its ability to recover and align with various preferred signals. Our results show that accounting for different rewards at the optimization level is important for building models that reliably match people's different preferences without sacrificing general capabilities.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button