Rl for consultation according to the conditions by adayalana

We suggest that reinforcement learning (RL) from partial expert representation is not only a training material, but a promising framework for solving sequential problem solving tasks. Fine-tuning (SFT) tuning depends on the labels being killed, which get worse as the length of the sequence increases. RL, on the other hand, strives for cheap rewards and a large area of combined output. We address this by introducing adaptive backtracking (Adaback), an algorithm for learning the curriculum that only reveals a partial beginning of the target output during training. The length of supervision is dynamically changed for each sample based on the previous reward signal, which allows you to learn to increase the completion of the consultation chains with appropriate solutions. We investigate this middle ground between SFT and RL and argue that the study of each curriculum is more than a trade-off between efficiency and productivity, it can succeed in many peaceful and reduced activities where SFT and RL both fail to generalize. Using synthetic work with latent parity constraints, we show that our dynamic algorithm over partial answers solves complex problems. In the mathematics benchmarks (Math, GSMMK), we find that learning subjects enable models to solve problems that cannot be acquired by new skills that cannot be expressed in partial solutions.
- † École polytechnique Fédérale de Lausanne (EPFL)
- * Equal employment



