Researchers of Shanghai Jiao Tongs proposed for Octothenker to strengthen the development of the strong LLM

INTRODUCTION: SUCCESSION OF THE PROCEDURE WITHING WITH THE CHAY-OF-REFING
The llms indicated the best progress in complex consultation activities by using the COT guidance associated with the Great Strengthenance (RL). Models like Deepseek-R1-zero show strong consultation skills by installing RL directly in Base models. Similarly, similar methods and the Open-Tiserzero display progress in small models such as QWEN series. However, to achieve success in all exemplary families the families remains challenging. In addition, using R1-Zero styles in Blland models as a LLAMA series facing difficulty, it puts a basic question regarding basic interests that lead to different learning.
RIL EMBUS RELATIONS IN LLAMA models
Great RL development in Models are like O1, O3, and R1 Deep R1 for Mathematics Mathematics problems, which promote RL obscections in small parameters. However, they are limited to the QWEN Model family, while repeating the results in the families such as LYM is difficult. The shortcomings of pre-train training pipes has made it difficult to understand that the previous training influences previous RL measurement. This has motivated illegal courses, which discovered that one passing improves the QWEN but provides less benefits in Lallama. Efforts to Start Mathematics of Mathematics in the Mathematics before such specialized projects, Mathpile, Infimm-Math, and Finemath has improved but there is always limited to 100b tokens.
Exploring Medical Training and Strategy – Dececy
Investigators from Shanghai Jiaa Tong University investigated the medical strategies for the study Secondly, qa style data is using, especially those that can reasonably more, improve RL results. Thirdly, a long canong present Marbosity and misunderstanding of RL training. Finally, use measurement during the effects of the most powerful RL training effects. The investigators are presented in the medium term area called stable – decomposing, where the back models start training in 200b tokens, followed by 20b tokens in all Octothenkers.
RL edition and Benchmark analysis
Investigators use Math8k of RL training RL training. The configuration includes a 128 World Batch size, 16 PPOSMENT answers, and PPO Mini-Batch 14 size, by LLAMA-3.2-3b-Basib-Base and QWEN2.5-3b-Base Models. Examination, a few shots for the basic language models, as well as zero-shots rl-tuned models of indicators, including GSM8K, the Math500, Olympiadbelch, and AMC23. During the training of RL, QWen models indicate the length of response to their constant response, and Llama showed unusual behavior, the normal response to 4,096 tokens. Views also reveals that the QW-Tunen QWen2.53B reaches improvements in all benches, while LLAMA-3.23B shows the benefits below.
Octothenger Outperforms Llama in RL Sync
Each Octothenker branch shows the development of 10% -20% on top of the original LLAMA model and consistent benefits over the model stable in all sizes during the 13 mathematical benches. Octothenker-zero families reveal various thinking methods during RL Establishment of RL, with strong performance from Octothenger-different. When comparing 3 B models of 3B scales of hybrid and short branches show lowest performance, especially on challenging benches
Conclusion and Future and Future: Referring to RL-Ready Found Models
This paper is investigating why the back models are like Llama and QWEN showing different behavior during consultation, indicating that medical training plays a major role in RL Salculability. The training strategy between two stages converts lama into a base model is better suited for RL, leading to Octothenker models. The coming research indicators include:
- Riding in high quality matस corpora develops training within.
- Building RL-FRINDY BASE USES ONLY OUTSIDENTS without decorating from long-term thinking models.
- Divide the QA format and content to understand their personal contributions.
- Expanding the Octothenker family with new branches, such as assumtions covered by tools.
Look Paper, kisses a face page and GitTub page. All credit for this study goes to research for this project. Also, feel free to follow it Sane and don't forget to join ours 100K + ml subreddit Then sign up for Our newspaper.

Sajjad Ansari final year less than qualifications from Iit Kharagpur. As a tech enthusiasm, he extends to practical AI applications that focus on the understanding of AI's technological impact and their true impacts on the world. Intending to specify the concepts of a complex AI clear and accessible manner.




