Reactive Machines

Learning Reasoning as Objects with Scalable Mid-Training RL

Large language models excel at reinforcement learning (RL), but fully unlocking this skill requires an intermediate training phase. A central functional training class should identify a compact set of useful actions and allow rapid selection between them via online RL. We legitimize this idea by presenting the first theoretical result of how average training is shaped after training: it shows a subfield of action that minimizes both the mean value error from pruning and the RL error during subsequent planning. Our analysis reveals two important determinants of the effectiveness of centralized training: the effectiveness of pruning, which changes the initial RL policy, and its effect on RL convergence, which governs the extent to which that policy can be improved through online interaction. These results suggest that intermediate training is more effective when the decision space is compact and the action horizon is short, highlighting the importance of working in the area of ​​action summaries rather than classical actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a central training algorithm. Specifically, we find low-correlated sequence variance and improve it by iteratively finding temporally invariant latent constructs with RL, followed by fine-tuning the bootstrapped data. Evaluation of code generation functions shows the effectiveness of our method. For all baseline models, RA3 improves the average performance in HumanEval and MBPP by 8 and 4 points over the baseline model and the next token prediction baseline. In addition, RA3 achieves faster convergence and higher asymptotic performance than RLVR in HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button