Learning Systematic Thinking with Trajectory Trajectory Control

Large language models can show emergent cognitive behavior, which often appears as repeated patterns of words (eg, “wait,” which indicates confirmation). However, complex imaging methods are often underrepresented in unconstrained samples, and conventional RL often fails to ensure the detection of a wide variety of behaviors. We propose the systematic discovery and strengthening of various thinking patterns through systematic thinking, a paradigm that requires the guided exploration of specific thinking patterns during the RL process. In this regard, we propose Ctrl-R, a framework for learning systematic thinking through the control of a concrete path that actively guides the process of extraction, which encourages the evaluation of various thinking patterns that are important for solving complex problems. The resulting policy behavior enables accurate estimation of sample importance, supporting unbiased policy development. We also introduce a robustness factor to the importance sampling weights, which allows the policy to learn selectively from test leads, which are out of distribution while maintaining a stable configuration. Experiments show that Ctrl-R enables effective exploration and internalization of thought patterns that were previously unattainable, resulting in consistent improvements across language models and language-perception in mathematical reasoning tasks.
- † University of California, Los Angeles
- ** Work done while at Apple



