Measuring accuracy and efficiency in Language Models: Two phase of RL training training for the background training

nimda April 11, 2025

0 5 3 minutes read

Measuring accuracy and efficiency in Language Models: Two phase of RL training training for the background training

Recent progress in the llms is very close to its thinking skills, especially in good structure based on RL. At first he is trained for the forecast for the prediction of the forecast, these post-rl training models, evaluate various ways to consult with appropriate answers, such as the agent costs the game. This process leads to preparations, which is often referred to as “a variety of the models, and increase the use of the better, research. They can reduce the performance, which shows refunds.

Investigators examine ways to measure the quality of consultation and efficiency to fix this. Methods include using smaller, quick models, using immediate engineering to reduce enthusiasm, and create a briefing strategies that promote short but effective thinking. One way notice is a long distillation until short-term, where models are available in detailed experts and trained to produce short but accurate answers. Using these strategies, models similar to me and indicate competing performance even in large models such as GPT-4 while eating a few tokens. Research also highlights the concept of “Cholphelxery,” indicating that problems require lower limit to the accurate adjustment token, and instant strategies are temporarily intended for the relevant point. In all, the findings emphasizes the importance of developing effective thinking management without compromising performance.

Investigators from Wand Ai challenged the belief that long answers led to better thinking in large language models. For theory and testing, they show that this service is a RL unity product instead of receiving accuracy. Interestingly, short answers often connect with high accuracy, and the correct answers are shorter than the wrong. They raise the two class training RL training: The first paragraph enhances the ability to consult, and the second parties. This method reduces the length of response without compromising the accuracy, which provides efficiency and functional costs of a computer.

Long answers do not always lead to better performance in language models. The ups of the RL usually reduces the length of response when it is stored or promoting accuracy, especially at the beginning of the training. This is fleeing the belief that long chains are necessary to fit. The object that is tied to the “defense,” when it comes out too long. Assessing language activities as Markov processes reveal that RL has reduced loss, not length, and long results appear only when rewards are incorrect. The two phase strategy – first in difficult problems, then in resolved mountains – can inspire while finally promotes alignment and stability.

The second rl phase strategy has led to a remarkable performance to achieve various model sizes. Training at various difficulties indicates that simple problems helped models of benefits while maintaining or improving accuracy. The second RL phase uses eight statistics issues produced in the bold and firmly produced from all the benches such as AIME, Amc, and 500 styles, with similar styles. Even little training in the list of RL is upgraded and stiffen under the lower temperature sample. In addition, models without previously refined RL, such as QWen-Math-Math-v2.5, show great intensity to strengthen the strengths-up to 30% of training four mathematical problems only.

In conclusion, research reflects the two-phase of RL training after improving the thinking and phoning of language diseases. The first phase improves accurate accuracy, while the second focus is in reducing the answers without giving up. R1 models are used, this method has reduced the length of response more than 40% while storing accuracy, especially at low temperatures. The findings indicate that long responses are naturally improved and that the target RL can reach a summary of a summary. Research also reflects that even a minor RL training can be most beneficial for non-imaginary models, emphasizing the amount of problems involving the PPO problems carefully.

Survey the paper. All credit for this study goes to research for this project. Also, feel free to follow it Sane and don't forget to join ours 85k + ml subreddit.

🚨]Recommended reading]Boson AI launches HIGGS audio understanding and receiving a HIGGS HIGGs to a limited vester (60.3 Average on Airbench Foundation) with its consultation development [Sponsored]

Sana Hassan, a contact in MarktechPost with a student of the Dual-degree student in the IIit Madras, loves to use technology and ai to deal with the real challenges of the world. I'm very interested in solving practical problems, brings a new view of ai solution to AI and real solutions.